danielpodlesny / samestr

SameStr identifies shared strains between pairs of metagenomic samples based on the similarity of SNV profiles.
GNU Affero General Public License v3.0
16 stars 3 forks source link

index out of range #7

Closed StickHu closed 1 year ago

StickHu commented 1 year ago

Amazing work for samestr!But I met a problem when using it. The metaphlan command are following.

metaphlan ${m}_bac.1.fastq.gz,${m}_bac.2.fastq.gz \ --nproc 10 --input_type fastq \ --legacy-output \ -o ${m}_mp.profile.txt \ -t rel_ab \ --bowtie2db $metaphlan_path \ --index mpa_vJan21_CHOCOPhlAnSGB_202103 \ --bowtie2out ${m}_mp.bowtie2out \ --samout ${m}_mp.sam.bz2

And I have done the command recommanded:

within python 3 environment

import pickle import bz2 mpa_pkl_file = 'mpa_vJan21_CHOCOPhlAnSGB_202103.pkl' mpa_pkl = pickle.load(bz2.BZ2File(mpa_pkl_file))

f = bz2.BZ2File(mpa_pkl_file.replace('.pkl', '.py2.pkl'), 'wb') pickle.dump(mpa_pkl, f, protocol = 0)

Then built the samestr_db: samestr db --mpa-pkl /db/mpa_vJan21_CHOCOPhlAnSGB_202103.py2.pkl
--mpa-markers /db/mpa_vJan21_CHOCOPhlAnSGB_202103_VSG.fna --output-dir samestr_db

Then do the samestr convert samestr convert --input-files sam_bz2/*_mp.sam.bz2 --marker-dir samestr_db/ --min-vcov 5 --nprocs 10 --output-dir out_convert/

But the index problem has occurred:

2022-12-03 17:54:34,367 | DEBUG | samestr.convert._bam2freq | bam2freq | 23 | Piling: /data/workdir/huwa/oral_DB/subgingival_plaque/PRJNA552294/samestr_result/out_convert/bam/SRR9684976_mp.bam parse /data/workdir/huwa/oral_DB/subgingival_plaque/PRJNA552294/samestr_result/out_convert/bam/SRR9684976_mp.bam Traceback (most recent call last): File "/data/workdir/huwa/software/miniconda3/envs/samestr/bin/kp2np.py", line 38, in contig = line[0] IndexError: list index out of range

Best Stick Hu

danielpodlesny commented 1 year ago

Hi @StickHu,

Thanks for trying out SameStr!

I think the problem might be with the database build. You used the virus db mpa_vJan21_CHOCOPhlAnSGB_202103_VSG.fna to build the samestr db but the metaphlan profiles generated when using the mpa_vJan21_CHOCOPhlAnSGB_202103 index also include marker hits from Bacteria, Archaea, and Eukarya found in mpa_vJan21_CHOCOPhlAnSGB_202103_SGB.fna.

The latter is omitted with your command. Hence, there is a mismatch between what is detected with MetaPhlAn (you didn't specify search for viruses) and what is queried with SameStr (you submitted only the virus db).

You could solve this in three ways:

Let me know if this solved it for you.

I will update the documentation to include instructions on MetaPhlAn4.

StickHu commented 1 year ago

Thanks for your help. And it works successfully now.