arpcard / rgi

Resistance Gene Identifier (RGI). Software to predict resistomes from protein or nucleotide data, including metagenomics data, based on homology and SNP models.
Other
330 stars 78 forks source link

[BUG] Repetitive Database Indexing on rgi bwt #165

Closed dcdanko-biotia closed 1 year ago

dcdanko-biotia commented 3 years ago

Describe the bug

When I run rgi bwt a large number of files are created but the program seemingly never terminates. Output indicates a database indexing step is being run over and over again, this indexing step always ends with a RunTime error.

RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

Input rgi bwt -o test_sample.rgi -1 foo.r1.fq.gz

Input file foo.r1.fq.gz, a gzipped fastq file containing 10 reads simulated fromnucleotide_fasta_protein_homolog_model.fasta using wgsim

Unzipped this file contains

@gb|GQ343019|+|132-1023|ARO:3002999|CblA-1_228_691_0:0:0_1:0:0_0/1
ACTGGACAAGATGGATAAGCAAAGCATCAGTCTGGACAGCATTGTTTCCATAAAGGCATCCCAAATGCCG
+
2222222222222222222222222222222222222222222222222222222222222222222222
@gb|GQ343019|+|132-1023|ARO:3002999|CblA-1_122_660_1:0:0_2:0:0_1/1
CCTTTGGTATAGCCGTATGGACAGACAAAGGAGACATGCTCCGGTATAACGACCATGTACACTTCCCCTT
+
2222222222222222222222222222222222222222222222222222222222222222222222
@gb|GQ343019|+|132-1023|ARO:3002999|CblA-1_189_742_1:0:0_3:0:0_2/1
TTTTCATAGCGTCGGGATTGCGGTCGGAAGAGCCGGTCTTGTGTGCTACCACGGTTTTGGCTGGCAACAT
+
2222222222222222222222222222222222222222222222222222222222222222222222
@gb|HQ845196|+|0-861|ARO:3001109|SHV-52_321_818_1:0:0_0:0:0_0/1
ACACCTTGCCGACGGCATGCCGGTCGGCGAACTCTGTGCCGCCGCCATTACCATGAGCGATAACAGCGCC
+
2222222222222222222222222222222222222222222222222222222222222222222222
@gb|HQ845196|+|0-861|ARO:3001109|SHV-52_231_724_2:0:0_2:0:0_1/1
GTCGCGGGTGGATGCCGGTGACGAACAGCTGGAGCGAAAGATCCACTATCGCCAGCATGATCTGGTGGAC
+
2222222222222222222222222222222222222222222222222222222222222222222222
@gb|HQ845196|+|0-861|ARO:3001109|SHV-52_177_738_1:0:0_3:0:0_2/1
GCCAAGCAGGGCGACAATCCCTCGCGCACCCCGTTCGGCAGCTCCGGTCTTCTCGGCGATAAACCAGCCC
+
2222222222222222222222222222222222222222222222222222222222222222222222
@gb|JX017365|+|244-1120|ARO:3001989|CTX-M-130_41_530_1:0:0_2:0:0_0/1
CGGCGTGCATTCCGCTGCTGCTGGGCAGCGGGCCGCTTTATGCGCAGACGAGTGCGGTGCAGCAAAAGCT
+
2222222222222222222222222222222222222222222222222222222222222222222222
@gb|JX017365|+|244-1120|ARO:3001989|CTX-M-130_359_865_1:0:0_2:0:0_1/1
TGACGCTGGCAGAACTGAGCTCGGCCGCGTTGCAGTACAGCGACAATACCGCCATGAACAAATTGATTGC
+
2222222222222222222222222222222222222222222222222222222222222222222222
@gb|JX017365|+|244-1120|ARO:3001989|CTX-M-130_263_788_0:0:0_2:0:0_2/1
AAACGCAAAAGCAGCTGCTTAATCAGCCTGTCGAGATCAAGCCTGCCGATCTGGTTAACTACAATCCGAT
+
2222222222222222222222222222222222222222222222222222222222222222222222
@gb|JN967644|+|0-813|ARO:3002356|NDM-6_289_745_4:0:0_0:0:0_0/1
CAGACCGCCCAGAACGTCCACTGGATCCAGCAGGAGATCAACCTGCCGGTCGCGCTGGCGGTGGTGACTC
+
2222222222222222222222222222222222222222222222222222222222222222222222

Error log

# Indexing databases.
# Updating DBs
# Reading inputfile:    /Users/dcdanko/miniconda3/envs/pybiotiadx/lib/python3.9/site-packages/app/_data/card_reference.fasta
# Added:    ARO:3002999|ID:2|Name:CblA-1|NCBI:GQ343019
<omitted ~10,000 similar lines>
# Added:    ARO:3005385|ID:4118|Name:KPC-57|NCBI:MT358626.1
# Templates key-value pairs:    1482114.
#
# Total time used for DB indexing: 1.36 s.
#
# Compressing templates
# Preparing compressed DB.
# Calculating relative indexes.
# Finalizing indexes.
# Dumping compressed DB
# Template database created.
#
# Total time used for DB compression: 1.53 s.
#
# Reading inputfile:    /Users/dcdanko/Dev/BiotiaDx/pipeline/src/docker/biotiadx_core/src/pybiotiadx/tests/data/foo.r1.fq.gz
# Phred scale:  0
#
# Query converted
#
# Collecting k-mer scores.
#
# Total time used for DB loading: 0.01 s.
#
# Finding k-mer ankers
# Query ankered
#
# Score collection done
#
# Sort, output and select k-mer alignments.
# Total time for sorting and outputting KMA alignment   0.00 s.
#
# Doing local assemblies of found templates, and output results
# Total time used for local assembly: 0.02 s.
#
# Closing files
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/dcdanko/miniconda3/envs/pybiotiadx/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/Users/dcdanko/miniconda3/envs/pybiotiadx/lib/python3.9/multiprocessing/spawn.py", line 125, in _main
    prepare(preparation_data)
  File "/Users/dcdanko/miniconda3/envs/pybiotiadx/lib/python3.9/multiprocessing/spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/Users/dcdanko/miniconda3/envs/pybiotiadx/lib/python3.9/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "/Users/dcdanko/miniconda3/envs/pybiotiadx/lib/python3.9/runpy.py", line 268, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/Users/dcdanko/miniconda3/envs/pybiotiadx/lib/python3.9/runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/Users/dcdanko/miniconda3/envs/pybiotiadx/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/dcdanko/miniconda3/envs/pybiotiadx/bin/rgi", line 4, in <module>
    MainBase()
  File "/Users/dcdanko/miniconda3/envs/pybiotiadx/lib/python3.9/site-packages/app/MainBase.py", line 82, in __init__
    getattr(self, args.command)()
  File "/Users/dcdanko/miniconda3/envs/pybiotiadx/lib/python3.9/site-packages/app/MainBase.py", line 270, in bwt
    self.bwt_run(args)
  File "/Users/dcdanko/miniconda3/envs/pybiotiadx/lib/python3.9/site-packages/app/MainBase.py", line 306, in bwt_run
    obj.run()
  File "/Users/dcdanko/miniconda3/envs/pybiotiadx/lib/python3.9/site-packages/app/BWT.py", line 1746, in run
    self.get_summary()
  File "/Users/dcdanko/miniconda3/envs/pybiotiadx/lib/python3.9/site-packages/app/BWT.py", line 1266, in get_summary
    with Pool(processes=self.threads) as p:
  File "/Users/dcdanko/miniconda3/envs/pybiotiadx/lib/python3.9/multiprocessing/context.py", line 119, in Pool
    return Pool(processes, initializer, initargs, maxtasksperchild,
  File "/Users/dcdanko/miniconda3/envs/pybiotiadx/lib/python3.9/multiprocessing/pool.py", line 212, in __init__
    self._repopulate_pool()
  File "/Users/dcdanko/miniconda3/envs/pybiotiadx/lib/python3.9/multiprocessing/pool.py", line 303, in _repopulate_pool
    return self._repopulate_pool_static(self._ctx, self.Process,
  File "/Users/dcdanko/miniconda3/envs/pybiotiadx/lib/python3.9/multiprocessing/pool.py", line 326, in _repopulate_pool_static
    w.start()
  File "/Users/dcdanko/miniconda3/envs/pybiotiadx/lib/python3.9/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/Users/dcdanko/miniconda3/envs/pybiotiadx/lib/python3.9/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/Users/dcdanko/miniconda3/envs/pybiotiadx/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/Users/dcdanko/miniconda3/envs/pybiotiadx/lib/python3.9/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/Users/dcdanko/miniconda3/envs/pybiotiadx/lib/python3.9/multiprocessing/popen_spawn_posix.py", line 42, in _launch
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "/Users/dcdanko/miniconda3/envs/pybiotiadx/lib/python3.9/multiprocessing/spawn.py", line 154, in get_preparation_data
    _check_not_importing_main()
  File "/Users/dcdanko/miniconda3/envs/pybiotiadx/lib/python3.9/multiprocessing/spawn.py", line 134, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
# Indexing databases.
# Updating DBs
# Reading inputfile:    /Users/dcdanko/miniconda3/envs/pybiotiadx/lib/python3.9/site-packages/app/_data/card_reference.fasta
# Added:    ARO:3002999|ID:2|Name:CblA-1|NCBI:GQ343019
<this repeats continuously>

CARD Version

$ rgi database -v
3.1.3

I installed this database using rgi auto_load on 2021-09-14

RGI version Resistance Gene Identifier - 5.2.1

Expected behavior

I would expect rgi to rapidly complete since the input sample has only 10 reads.

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

Additional context

$ pip freeze
biopython==1.78
numpy @ file:///Users/runner/miniforge3/conda-bld/numpy_1626648418679/work
pysam==0.16.0.1
RGI @ git+https://github.com/arpcard/rgi.git@040a781888e6f495bc15683f2194bbe55096e6e2
rsa==4.7.2
scipy==1.7.1
<some libraries omitted>
raphenya commented 3 years ago

@dcdanko-biotia I ran the following on Mac OS and I didn't get any errors you encountered:

# downloaded latest commit from https://github.com/arpcard/rgi
# unzip rgi-master.zip file
# load the database and setup in a local directory
python3 ./rgi-master/rgi auto_load --local --debug > load.log 2>&1
# run rgi bwt on the reads provided in this issue
python3 ./rgi-master/rgi bwt -o test_sample.rgi -1 foo.r1.fq.gz --debug --local > run.log 2>&1

See attached logs load.log run.log

Results: test_sample.rgi.gene_mapping_data.txt

Can you provide step-by-step what commands you ran?

raphenya commented 1 year ago

@dcdanko-biotia please re-open issue if needed. cheers.