alexcritschristoph / soil_popgen

Reproducible scripts and notebooks for 2019 paper on population genetics in metagenomes
GNU General Public License v3.0
14 stars 0 forks source link

install inStrain_lite #1

Open palomo11 opened 5 years ago

palomo11 commented 5 years ago

Hi,

I'm trying to use inStrain_lite with my own data. I have tried to install inStrain_lite, but I have not been able to make it, yet.

When I try $ pip3 install inStrain_lite, I get:

Collecting inStrain_lite ERROR: Could not find a version that satisfies the requirement inStrain_lite (from versions: none) ERROR: No matching distribution found for inStrain_lite

I have cloned the repository and then try again, but same error.

Any advice?

Thanks.

alexcritschristoph commented 5 years ago

Hi! If you cd into the directory ./inStrain_lite/ and then run pip3 install . then it will install inStrain_lite from that directory (it isn't in the pip repositories, apologies for the confusion!).

Just so you know, we are working on a more complete and robust version of this program that hopefully will be completed and released in 2019.

Thanks! Alex

palomo11 commented 5 years ago

Thanks for the answer. It worked.

I have started to run some of the scripts, but I have some questions:

In the "tutorial" it says:

"Briefly, a bowtie2 index is created using representative members for all 664 dereplicated genomes. Reads were then mapped to this index using bowtie2"

this means that the 664 fasta files (1 per dereplicated genome) are combined, and then it creates and index of that file? And then each metagenome is mapped agains that 1 index?

In the Meadow-wide population profiling section, it says: "A BAM file for each species containing all filtered reads assigned to that species from the meadow was created using combine_filter_bams.py"

If all dereplicated genomes were combined into one, and then the bam from each metagenome was created, how to separate the bam into each species bam files?

For the Meadow-wide population profiling, why the option --min_breadth_cov 0.5,5 is not used?

For the Per-sample population profiling, the bam files used in the example is "../bams/all_14_0903_02_20cm_sorted.bam", I thought here you use a bam file that represents 1 metagenome mapped agains 1 derepeplicated genome. Am I wrong?

Thanks a lot in advance. It sounds good the new version you are working on. But, I'm a bit in a hurry so I think, meanwhile, I will continue with what you have made available so far. I will keep on eye to see if there is any update or release, in any case.

alexcritschristoph commented 5 years ago

Hi! Apologies for the slow reply, I was on vacation.

this means that the 664 fasta files (1 per dereplicated genome) are combined, and then it creates and index of that file? And then each metagenome is mapped agains that 1 index?

Yes that is exactly right. The importance of including every genome in the index at once is to highly reduce the number of mismapped reads - if a read maps well to two genomes then it should not be used as evidence for a SNP for either of them - we don't know which it came from.

If all dereplicated genomes were combined into one, and then the bam from each metagenome was created, how to separate the bam into each species bam files?

I pass the combined bam (664 fasta files) and then just a fasta file for each genome of interest each time I run the script. Your FASTA doesn't have to completely match your bam, it just has to be a subset of it. So I run separately for each genome with its own fasta.

For the Meadow-wide population profiling, why the option --min_breadth_cov 0.5,5 is not used? The option --min_breadth_cov will cause the program to only output the final files if at least 50% of the genome is at 5x coverage. It saves a lot of space in my case eliminating files which would be low coverage. Because I know the meadow-wide population is going to be much higher coverage than this, I don't need this option there.

For the Per-sample population profiling, the bam files used in the example is "../bams/all_14_0903_02_20cm_sorted.bam", I thought here you use a bam file that represents 1 metagenome mapped agains 1 derepeplicated genome. Am I wrong?

Ah yeah, the BAM is always created from all of my genomes, and then the FASTA is just that genome. I think it's very important to reduce read mismapping between genomes.

Hopefully this is helpful for you!

palomo11 commented 5 years ago

Thanks a lot. It makes completely sense. Specially doing the bam with all genomes combined to avoid read mismapping between genomes.

I will start running the scripts, and I will let you know if I encounter any issue/question.

Thanks again!