coleygroup / del_qsar

MIT License
31 stars 9 forks source link

Question regarding the datasets -- and the exp_tot and beads_tot values. #6

Closed danielvik closed 3 months ago

danielvik commented 3 months ago

Hi,

First of all, thanks for writing this paper and making the code accessible here.

I am wanting to apply your method to my own datasets, and in this context I am in doubt about how exactly you arrive at the exp_tot and beads_tot values in the DD1S_CAIX_QSAR.csv file.

You mention in the paper that it is remake of the DEL-DOS-1 dataset from Gerry et al., which have 2 no-beads replicates and 4 ca9 replicates. Are the exp_tot and beads_tot values in your dataset simply a sum of these? or are the calculations more advanced?

You reference 'a custom Python script' in the Data Processing section of the paper, but from what I can see in this repo the code is not available.

Thanks, Daniel

connorcoley commented 3 months ago

Thanks!

The total counts should just be the sums of counts and not anything more complicated, yes

The custom Python script isn't anything terribly fancy. It just iterates through the fasta file and builds up a dictionary of counts, which is then reformatted into those .csv files. It reads from spreadsheets of BB and library tags to know what to look for and allows some errors in matching when a non-exact-match can still be resolved to an expected tag unambiguously, i.e., when the Hamming distance between the observed sequence and an expected tag is less than the distance to any other expected tag. A lot of the details of this script are specific to how we had been storing our DEL metadata, so it's not terribly useful to share

danielvik commented 3 months ago

Awesome! 🙏 Thanks for a quick reply - and again I appreciate the work you've done here.