Open Nautilus96 opened 3 months ago
Hi Sebastian, thank for reaching out!
I believe that either of your solutions should work, although there may be some unexpected bugs that pop up with the first. I personally always use unique integers for my SNP positions, so that's the most robust / well-tested of the two.
Because StrainFacts is agnostic to the differences between positions (e.g. it doesn't care about linear ordering), you can assign positions any way you like, as long as they're unique.
I think the approach I would take is to get a sorted list of all (contig,local-position) pairs, and then assign ascending integers to them:
contig-name local-position position
--- --- ---
contig-1 1 1
contig-1 5 2
contig-2 2 3
contig-3 5 4
Where the third, "position" column is the one used by StrainFacts. Saving that mapping file will allow you to convert back into your original coordinates for downstream analyses.
I'll leave this issue open for at least a week or two. Please report back if you run into any errors with either approach.
Cheers; hope this helps!
Hello Byron,
Thank you for your reply! This will definitely help a lot. I will try it out and if I encounter any issues I will let you know.
I was wondering if I could get your help with something else though. Since model selection for StrainFacts seems to be somewhat complex, I was planning on comparing the results with those of StrainFinder, since it provides the AIC as straightforward model evaluation criterion. Yesterday I managed to run StrainFinder, but so far I have been unable to extract any information from the resulting EM object. Since you used StrainFinder for your paper on StrainFacts, would you be open to sharing any code or scripts that you used for the post-processing of the StrainFinder results? Unfortunately the documentation for StrainFinder is lacking and the GitHub page has been inactive for quite a while now, so any information is greatly appreciated!
Best, Sebastian
Hi! Yes, I can point you to some of my resources.
All of my analysis code for the StrainFacts paper is here: https://github.com/bsmith89/strainfacts-manuscript/
To find the relevant bits I suggest starting with my Snakefile focused on StrainFinder benchmarking: https://github.com/bsmith89/StrainFacts-manuscript/blob/main/snake/sfinder.smk
This references a couple of important scripts:
Unfortunately, it's not like I've done a careful job of documenting those scripts for additional users. Hopefully the Snakefile will help demonstrate their usage.
Also check out the (potentially valuable) patch that I made to the StrainFinder codebase (but if you don't need to set the random seed, this won't matter): https://github.com/bsmith89/StrainFacts-manuscript/blob/main/include/StrainFinder.patch
It sounds like you have StrainFinder running, but you may also get something out of my conda environment: https://github.com/bsmith89/StrainFacts-manuscript/blob/main/conda/sfinder.yaml
Hope this helps!
Hello there!
I am interesting on using StrainFacts, but for various reasons using a metagenotyper like GT-Pro does not work well for my data. My plan is to profile SNPS using GATK, and extract the SNPS from the resulting VCF files. This seems doable after looking at the TSV input format.
However, I have some questions regarding the position column. One of the genomes I am using as a reference for variant calling is unfortunately not closed (uncultured bacterium, so it's the best we have). In the resulting vcf file, variant positions are then reported first by contig number, followed by a numeric index indicating the position in said contig.
You mention in issue #5 that the position column is "a unique identifier indexing each of the genomic positions being metagenotyped". For my case, just using the numeric index of the VCF file would of course not work, but here are the two solutions I have considered:
Do you have any thoughts on this? I would welcome any recommendations!
Best, Sebastian