adriantich / DnoisE

Distance denoise by Entropy
GNU General Public License v3.0
12 stars 3 forks source link

Request for a .csv table format example #9

Closed SanniH closed 3 years ago

SanniH commented 3 years ago

Hi Adria,

Thank you for your help in fixing the previous bug so promptly! I have a more general question about the use of DnoiSE this time, if that's ok.

Would you be able to provide an example of what type of csv table can be used as an input file? I understand it is meant to have "sample data", but does this mean in the format of an OTU table for the dereplicated sequences? And if so, are there specific column names for sequences, and do they have to be in a specific position in the csv (i.e. 2nd column)? Currently, my csv is an otu table with dereplicated ESVs as rows, and samples as columns, with no sequence column included yet (as I wasn't sure where I should put it). snippet of my .csv (as copied from excel, sep normally ","):

OTU ID | 100a | 101a | 101b | 101c | 102a | 102b | 102c | 103a

uniq1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 uniq10 | 0 | 1 | 7 | 3 | 3 | 0 | 1 | 0 uniq100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 185 uniq1000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0

I haven't quite been able to figure out how to map the denoised reads (from fasta input) back to my original samples. I suppose I could use a list of the denoised sequence id's to construct another otu table after DnoiSE, but I have concerns of losing detections in case there is a sample that only has the "daughter" sequence(s), and not the mother. So if the otu table information can be included for the denoising step already, that would make the use of DnoiSE a lot easier for those of us with a little less bioinformatics or command line expertise... :)

Apologies in advance if this is explained somewhere already, but I was unable to find it!

Thanks, Sanni Hintikka

adriantich commented 3 years ago

Thanks again for all your comments!! In few weeks i will post an example of "how to work with" but I need some time more. Meanwhile, I hope the following information is useful. When working with .csv input files, -f and -p parameters must be specified. (-f F -p 2, for comma separator as your example, for more info see README.md). To return a .csv file, -F must be also False (-F F) The only "must" columns are the id, total read abundance and sequence. Sample information can be also in the dataset but position must be specified using -s and -z parameters and must be together. Total abundance and sequence column name is also user settable being 'size' and 'sequence' the default values.

I present you different examples: 1- No sample information and default names: OTU_ID,size,sequence uniq1,1,ACTAGCTAGA... uniq10,15,AGTCAGGAT... uniq100,185,AGGTAGGTT... uniq1000,14,AGTCGATGCT...

$ python3 DnoisE/src/DnoisE.py -i UTILA_DSE.csv -o Utila -c 20 -f F -F F -p 2

2- Sample information and default names: OTU_ID,size,100a,101a,101b,101c,102a,sequence uniq1,1,0,0,0,1,0,ACTAGCTAGA... uniq10,15,0,0,13,1,1,AGTCAGGAT... uniq100,185,170,0,13,1,1,AGGTAGGTT... uniq1000,14,0,13,1,0,0,AGTCGATGCT...

$ python3 DnoisE/src/DnoisE.py -i UTILA_DSE.csv -o Utila -c 20 -f F -F F -p 2 -s 3 -z 7

3- Sample information non default names and different order: OTU_ID,reads,seq,100a,101a,101b,101c,102a uniq1,1,ACTAGCTAGA[...],0,0,0,1,0 uniq10,15,AGTCAGGAT[...],0,0,13,1,1 uniq100,185,AGGTAGGTT[...],170,0,13,1,1 uniq1000,14,AGTCGATGCT[...],0,13,1,0,0

$ python3 DnoisE/src/DnoisE.py -i UTILA_DSE.csv -o Utila -c 20 -f F -F F -p 2 -s 4 -z 8 -n reads -q seq

Hope these examples are useful.

If you anything else just tell me!

Adrià Antich

SanniH commented 3 years ago

Hi,

Thanks again for your swift reply! Those examples are great, makes it much clearer :)

I just want to confirm that the output then uses both the daughter and mother reads to construct the new csv with sample information (to avoid losing detections in samples that only have daughter seqs as mentioned before), and renames them with the mother seq id, instead of only mapping the mother sequence abundances to samples? i.e., Seq1 (mother seq) found in samples A and B Seq2 (Daughter of Seq1) found in sample C, renamed to Seq1 but retaining the abundance of Seq2

Does that make sense?

Cheers, SanniH

adriantich commented 3 years ago

Yes. All Daughters are merged to their mothers and the abundances of both total and samples are the sum of all merged sequences. On your example the now Seq1 (mother) will have the abundances of all merged sequences per samples and not only the mother's. i.e., OTU_ID,size,100a,101a,101b,101c,102a,sequence uniq1,1,0,0,0,1,0,ACTAGCTAGA... uniq10,15,0,0,13,1,1,AGTCAGGAT... uniq100,185,170,0,13,1,1,AGGTAGGTT... uniq1000,14,0,13,1,0,0,AGTCGATGCT...

if uniq1000 is daughter of uniq100 the result will be:

OTU_ID,size,100a,101a,101b,101c,102a,sequence uniq100,199,170,13,14,1,1,AGGTAGGTT... uniq10,15,0,0,13,1,1,AGTCAGGAT... uniq1,1,0,0,0,1,0,ACTAGCTAGA...

DnoisE also returns an "denoising_info" output with the information of how sequences are merged.

Happy to help! A.

SanniH commented 3 years ago

That's great! Thanks for the help, I look forward to seeing the how-to guide! Cheers, Sanni