Questions about multiple steps on the pipeline

ilevade commented 5 years ago

Hi Chris,

Thank you for your program !

I'm running Desman on multiple gut microbiome samples, trying to identify different species which are not E. coli and I have multiple questions regarding the pipeline.

1/At the step where you want to identify the core COGS: "Then we determine those regions of the contigs with core COGs on in single copy using the 982 predetermined E. coli core COGs: $DESMAN/scripts/SelectContigsPos.pl $DESMAN/complete_example/EColi_core_ident95.txt < ClusterEC.cogs > ClusterEC_core.cogs"

How do you generate the "EColi_core_ident95.txt" file ? I probably missed a step, but I couldn't find the information in the protocol. In other DESMAN tutoriels, I saw this file was called scgs.txt. But it wasn't described how to generate it.

2/ The script Variant_filter.py generates 8 output files but I'm not quite sure about what they all are and what do represent the different columns in each of them. Sorry if it's already described somewhere.

3/When running DESMAN on real data, I wasn't sure about how much strains and replicates to test so I did multiples test and got this results:

when testing 8 strains with 10 replicates: 2,2,9,0.0,ClusterVC_2_9/Filtered_Tau_star.csv
10 strains with 20 replicates: 7,4,6,0.061764705882352944,ClusterVC_7_6/Filtered_Tau_star.csv
15 strains with 30 replicates: 8,4,14,0.0667892156862745,ClusterVC_8_14/Filtered_Tau_star.csv

I think I've read that more iterations are advisable on real data, but it seems that the error rate is higher in this case. Not sure what to think about it.

4/For the validation of strains: I'm not sure if I understood what is the file Hits.tar.gz and how to generate it. Also, when using the script validateSNP2.py, I didn't find how to generate the file Collated_Tau_mean.csv.

Complete example: python3 $DESMAN/scripts/validateSNP2.py ../RunDesman/ClusterEC_6_2/Collated_Tau_mean.csv ClusterEC_core_tau_gene.csv

I looked in other DESMAN tutoriel and find different exemple and wasn't sure which one to follow. STAMP 2017 tutoriel: python $DESMAN/scripts/validateSNP2.py Cluster16_2_3/Filtered_Tau_star.csv Cluster16_core_tauRF.csv

Ebame4: python $DESMAN/scripts/validateSNP2.py Cluster14_3_0/Filtered_Tau_star.csv Cluster14_3_0/Filtered_Tau_star.csv

Sorry for all the questions, some members of my lab where at the ANVI'O workshop in UK this summer but I couldn't, I would have prefer to ask all this in person :)

Thank you so much for your help !

Ines

kassammo commented 5 years ago

Hi Ines, I am also having same information because i am working on gut metagenomics. Finally, did you get some answers ?

Thanks,

Mohamed

mherold1 commented 4 years ago

1/ probably you could also use this list from (https://www.ncbi.nlm.nih.gov/pubmed/25218180?dopt=Abstract ):

COG0016
COG0060
COG0184
COG0049
COG0088
COG0092
COG0094
COG0197
COG0201
COG0532
COG0048
COG0052
COG0080
COG0081
COG0087
COG0090
COG0093
COG0096
COG0097
COG0103
COG0256
COG0051
COG0072
COG0089
COG0091
COG0100
COG0102
COG0185
COG0200
COG0244
COG0186
COG0198
COG0541
COG0552
COG0504
COG0130

2/ there is some more detailed description of the output files in the simple example for variant filter:

    COG0015_outp_df.csv: This gives p-values for each position.

    COG0015_outq_df.csv: This gives q-values for each position.

    COG0015_outr_df.csv: This gives log-ratio statistics for each position.

    COG0015_outsel_var.csv: This is the file of selected variants.

    COG0015_outtran_df.csv: A matrix of estimated error rates.

chrisquince / DESMAN

Questions about multiple steps on the pipeline #37