MrOlm / inStrain

Bioinformatics program inStrain
MIT License
137 stars 33 forks source link

pNpS_variants value #188

Closed ucassee closed 3 months ago

ucassee commented 3 months ago

Hi developer,

  1. Why value of SNV_N_count、SNV_S_count and pNpS_variants in output/xxx.IS_gene_info.tsv is not the same as in raw_data/genes_SNP_count.csv

  2. What is the meaning of S_sites and N_sites. Why aren't they integers?

genes_SNP_count.csv:

mm gene gene_length divergent_site_count SNS_count SNS_N_count SNS_S_count SNV_count SNV_N_count SNV_S_count S_sites N_sites dNdS_substitutions pNpS_variants 0 1 contig1_1 765 7 0 0 0 7 3 4 161.6666667 603.3333333 0.200966851

IS_gene_info.tsv:

scaffold gene gene_length coverage breadth breadth_minCov nucl_diversity start end direction partial dNdS_substitutions pNpS_variants SNV_count SNV_S_count SNV_N_count SNS_count SNS_S_count SNS_N_count divergent_site_count contig1 contig1_1 765 642.5477124183006 1.0 1.0 0.0192196642651277 1 765 -1 False 0.2446552966610617 44 23 21 0 0 0 46

Thanks

MrOlm commented 3 months ago

Hello,

1) Do you have multiple lines for the same gene in the raw_data file for each mm? The mm thing is pretty weird and probably can explain this discrepancy: https://instrain.readthedocs.io/en/v1.3.0/Advanced_use.html#dealing-with-mm

2) Each base can be mutated to one of 3 bases, and some of those bases are "S" and some are "N" based on the codon table. S_sites and N_sites can thus end in 0.33 or 0.66. inStrain uses the standard method of calculating pN/pS.

Best, Matt

ucassee commented 3 months ago

Hi Matt,

Thanks for your reply.

Yes, there are multiple lines for the same gene in the genes_SNP_count.csv file to represent different mismatch levels.
Which lines does instrains actually use to count SNP in the final result of IS_gene_info.tsv file?

Yingli

MrOlm commented 3 months ago

That's a complicated question, as indicated here: https://instrain.readthedocs.io/en/v1.3.0/Advanced_use.html#dealing-with-mm. The raw data files aren't really meant for users to look at. Is there some reason you want to use that raw file instead of the IS_gene_info.tsv file?

Matt

ucassee commented 3 months ago

No, I just want to confirm if using the gene_info.tsv file is a suitable choice. I think it is okay

Thanks