danielpodlesny / samestr

SameStr identifies shared strains between pairs of metagenomic samples based on the similarity of SNV profiles.
GNU Affero General Public License v3.0
16 stars 3 forks source link

Metaphlan version #1

Closed ShaneMota closed 2 years ago

ShaneMota commented 2 years ago

Hello, Excited to use your tool for my data, but looks like I have issues with metaphlan versions and I was wondering if you are able to help.

here is my align command;

samestr align \ --input-files data/*fastq.gz \ --input-sequence-type paired \ --kneaddata-exe /opt/software/miniconda3/envs/samestr/bin/kneaddata \ --fastq-stats-exe /opt/software/miniconda3/envs/samestr/bin/fastq-stats \ --host-bowtie2db /ref_dbs/Homo_sapiens_Bowtie2_v0.1/Homo_sapiens \ --metaphlan2-exe /opt/software/miniconda3/envs/metaphlan3/bin/metaphlan \ --mpa /dataone/common/ref_dbs/metaphlan3/mpa_v30_CHOCOPhlAn_201901 \ --mpa-pkl /dataone/common/ref_dbs/metaphlan3/mpa_v30_CHOCOPhlAn_201901.pkl \ --nprocs 30 \ --output-dir out_align/

Looks like Metaphlan3 doesn't have '--mpa-pkl' option anymore as I get this below error: metaphlan: error: unrecognized arguments: --mpa_pkl /opt/metaphlan2/db_v20/mpa_v20_m200.pkl

I also tried Metaphlan2-exe but I think Metaphlan2 is not available anymore, can't find their bitbucket page or their database. I also tried to install metaphlan2 via conda but getting error while it's trying to download Downloading https://bitbucket.org/biobakery/metaphlan2/downloads/mpa_latest...

Best, Shane

danielpodlesny commented 2 years ago

Hi Shane,

Thanks for your message. Indeed the align step is currently set up for MP2. Since MP3 switched to python 3.X, it will require some major adjustments. I will post more documentation to the repo soon.

Since align is just a wrapper around kneaddata & MetaPhlAn, in the meantime, please follow to their guidelines at https://github.com/biobakery/kneaddata and https://github.com/biobakery/MetaPhlAn to install and run their software, and generate the MP3 alignments on your own.

From your post the commands could look ~something like this, where $ID is the sample identifier:

Kneaddata:

kneaddata \
    -i ${ID}.R1.fastq.gz \
    -i ${ID}.R2.fastq.gz \
    -db /ref_dbs/Homo_sapiens_Bowtie2_v0.1/Homo_sapiens \
    -p 2 \
    -t 15 \
    --max-memory $RAM \
    --output-prefix ${ID} \
    --cat-final-output \
    --remove-intermediate-output \
    -o data/ 

MetaPhlAn3:

metaphlan data/${ID}.fastq.gz \
    --bowtie2db /dataone/common/ref_dbs/metaphlan3/ \
    --input_type fastq \
    --nproc 30 \
    --legacy-output \
    -t rel_ab \
    --bowtie2out $OUTDIR/${ID}.mp.bowtie2out \
    --samout out_align/${ID}.mp.sam.bz2 \
    -o out_align/${ID}.mp.profile.txt

For MetaPhlAn3, make sure to include the --legacy-output and --samout flags, and convert the database format for python 2.X/3.X compatibility reasons. You can then proceed with SameStr's convert step using the MP3 alignments (.sam.bz2).

Hope this helps, Daniel

huyue87 commented 2 years ago

I moved this to a new issue

danielpodlesny commented 2 years ago

samestr align will be deprecated for future MetaPhlAn versions (>v2) due to incompatibilities between python versions. The README contains information on how to work around these changes to successfully run SameStr on MetaPhlAn v3 and higher.