CSB5 / OPERA-MS

OPERA-MS - Hybrid Metagenomic Assembler
Other
89 stars 17 forks source link

Suggestion for making GTDB database #91

Open Xinpeng021001 opened 2 months ago

Xinpeng021001 commented 2 months ago

Hi,

I followed the wiki to create the GTDB-database and I noticed in the final step there might be some errors:

find OPERA-MS-DB/ -type f -name '*.fna.gz' > OPERA-MS-DB/genomes_list.tx

It will give a empty list file and I guess it should be:

find -L OPERA-MS-DB/ -type f -name '*.fna.gz' > OPERA-MS-DB/genomes_list.txt

Otherwise when run the strain cluster step, it may give the error and fail at that part.

Best,

jsgounot commented 2 months ago

Hi Xinpeng,

thanks for letting me know.

Regards, JS

Xinpeng021001 commented 2 months ago

Also forgot to mention: the threads function of the python program will give errors if using multiple threads(more than 1), I fixed manually and could send it later if needed.

Best Regards, Xinpeng

jsgounot commented 2 months ago

Oh, that's interesting, it works fine on my machine and others. I'm interested to see the error message (if you still have it) and the fix, thanks.

Xinpeng021001 commented 2 months ago

I guess it might be my env/python version error if it works for you. Let me post it here:

python $WORK/final_course_project/glacier_algae/script/OPERA-MS/src_utils/make_operams_db_from_gtdb.py all_genomes.txt.gz all_taxonomy_r220.tsv.gz --outdir test --threads 16 Read taxonomy file Check taxonomic information Read genome file Check concordance Define genome size and seq numbers. Number of threads: 16 Traceback (most recent call last): File "/work/yinlab/xinpeng/final_course_project/glacier_algae/script/OPERA-MS/src_utils/make_operams_db_from_gtdb.py", line 148, in main() File "/work/yinlab/xinpeng/final_course_project/glacier_algae/script/OPERA-MS/src_utils/make_operams_db_from_gtdb.py", line 145, in main process(args) File "/work/yinlab/xinpeng/final_course_project/glacier_algae/script/OPERA-MS/src_utils/make_operams_db_from_gtdb.py", line 60, in process seqinfos = multi_threads_seqinfos(fnames) if args.threads > 1 else single_thread_seqinfos(fnames) File "/work/yinlab/xinpeng/final_course_project/glacier_algae/script/OPERA-MS/src_utils/make_operams_db_from_gtdb.py", line 109, in multi_threads_seqinfos with concurrent.futures.ProcessPoolExecutor(max_workers=args.threads) as executor: NameError: name 'args' is not defined

Xinpeng021001 commented 2 months ago

The old code:

def multi_threads_seqinfos(fnames): seqinfos = {} with concurrent.futures.ProcessPoolExecutor(max_workers=args.threads) as executor: if USED_TQDM: iterator = tqdm.tqdm(executor.map(fasta_info, fnames), total=len(fnames)) else: iterator = executor.map(fasta_info, fnames)

    for seqres in iterator:
        seqinfos.update(seqres)

return seqinfos

and the fixed:

def multi_threads_seqinfos(fnames, threads): seqinfos = {} with concurrent.futures.ProcessPoolExecutor(max_workers=threads) as executor: if USED_TQDM: iterator = tqdm.tqdm(executor.map(fasta_info, fnames), total=len(fnames)) else: iterator = executor.map(fasta_info, fnames)

    for seqres in iterator:
        seqinfos.update(seqres)

return seqinfos
Xinpeng021001 commented 2 months ago

by the way, you forget a "t" in the find command :)

find -L OPERA-MS-DB/ -type f -name '*.fna.gz' > OPERA-MS-DB/genomes_list.tx

find -L OPERA-MS-DB/ -type f -name '*.fna.gz' > OPERA-MS-DB/genomes_list.txt

jsgounot commented 2 months ago

This is weird, I should have caught this issue before. Thanks for letting me know.

Xinpeng021001 commented 2 months ago

my pleasure :)