Questions about TSV input format

Nautilus96 commented 3 months ago

Hello there!

I am interesting on using StrainFacts, but for various reasons using a metagenotyper like GT-Pro does not work well for my data. My plan is to profile SNPS using GATK, and extract the SNPS from the resulting VCF files. This seems doable after looking at the TSV input format.

However, I have some questions regarding the position column. One of the genomes I am using as a reference for variant calling is unfortunately not closed (uncultured bacterium, so it's the best we have). In the resulting vcf file, variant positions are then reported first by contig number, followed by a numeric index indicating the position in said contig.

You mention in issue #5 that the position column is "a unique identifier indexing each of the genomic positions being metagenotyped". For my case, just using the numeric index of the VCF file would of course not work, but here are the two solutions I have considered:

Merging the contig name with numeric index fields, creating an alpha-numeric index (eg: Contig1_pos_1). Would this work for StrainFacts? Or does it expect a strictly numeric index?
If a strictly numeric index is required, I thought about re-indexing the position by "ignoring" the contig identity, so that once a variant in a new contig is encountered, the numeric index simply continues insted of "reseting". This would basically amount to assuming the genome is one single contig, and I am not sure what the consequences of this would be for StrainFacts.

Do you have any thoughts on this? I would welcome any recommendations!

Best, Sebastian

bsmith89 commented 3 months ago

Hi Sebastian, thank for reaching out!

I believe that either of your solutions should work, although there may be some unexpected bugs that pop up with the first. I personally always use unique integers for my SNP positions, so that's the most robust / well-tested of the two.

Because StrainFacts is agnostic to the differences between positions (e.g. it doesn't care about linear ordering), you can assign positions any way you like, as long as they're unique.

I think the approach I would take is to get a sorted list of all (contig,local-position) pairs, and then assign ascending integers to them:

contig-name      local-position   position
---              ---              ---
contig-1         1                1
contig-1         5                2
contig-2         2                3
contig-3         5                4

Where the third, "position" column is the one used by StrainFacts. Saving that mapping file will allow you to convert back into your original coordinates for downstream analyses.

I'll leave this issue open for at least a week or two. Please report back if you run into any errors with either approach.

Cheers; hope this helps!

Nautilus96 commented 3 months ago

Hello Byron,

Thank you for your reply! This will definitely help a lot. I will try it out and if I encounter any issues I will let you know.

I was wondering if I could get your help with something else though. Since model selection for StrainFacts seems to be somewhat complex, I was planning on comparing the results with those of StrainFinder, since it provides the AIC as straightforward model evaluation criterion. Yesterday I managed to run StrainFinder, but so far I have been unable to extract any information from the resulting EM object. Since you used StrainFinder for your paper on StrainFacts, would you be open to sharing any code or scripts that you used for the post-processing of the StrainFinder results? Unfortunately the documentation for StrainFinder is lacking and the GitHub page has been inactive for quite a while now, so any information is greatly appreciated!

Best, Sebastian

bsmith89 commented 3 months ago

Hi! Yes, I can point you to some of my resources.

All of my analysis code for the StrainFacts paper is here: https://github.com/bsmith89/strainfacts-manuscript/

To find the relevant bits I suggest starting with my Snakefile focused on StrainFinder benchmarking: https://github.com/bsmith89/StrainFacts-manuscript/blob/main/snake/sfinder.smk

This references a couple of important scripts:

Unfortunately, it's not like I've done a careful job of documenting those scripts for additional users. Hopefully the Snakefile will help demonstrate their usage.

Also check out the (potentially valuable) patch that I made to the StrainFinder codebase (but if you don't need to set the random seed, this won't matter): https://github.com/bsmith89/StrainFacts-manuscript/blob/main/include/StrainFinder.patch

It sounds like you have StrainFinder running, but you may also get something out of my conda environment: https://github.com/bsmith89/StrainFacts-manuscript/blob/main/conda/sfinder.yaml

Hope this helps!

bsmith89 / StrainFacts

Questions about TSV input format #15