antigenomics / vdjdb-db

🗂️ [vdjdb.cdr3.net is up and running] Git-based TCR database storage & management. Submissions welcome!
https://vdjdb.cdr3.net
Other
127 stars 27 forks source link

Error when building most recent database #330

Closed bpkwee closed 1 year ago

bpkwee commented 2 years ago

Hi,

Thanks for generating this dataset, it is a great resource.

I previously cloned the database directory from vdjdb-web, but I would like to use the most recent database for a research project. However, when attempting to build the database I ran into some errors.

I cloned the directory from this GitHub page and installed the most recent version of groovy using brew and biopython using pip (the latter was not mentioned in README.md). When running the provided command groovy -cp . BuildDatabase.groovy from the src directory the following occurs: sys:1: DtypeWarning: Columns (20,29,30) have mixed types.Specify dtype option on import or set low_memory=False.

Context:

Fixing CDR3 sequences (stage II)
(it may take a while...)
sys:1: DtypeWarning: Columns (20,29,30) have mixed types.Specify dtype option on import or set low_memory=False.
-- processing the vdjdb_full.txt table
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/opt/miniconda3/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/opt/miniconda3/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/Volumes/data_storag/vdjdb/vdjdb-db/src/AlignBestSegments.py", line 132, in fix_json
    for _, seg_row in segments[(segments.species == species) & (segments.gene == seg_gene_type) & (segments.segment == "Variable")].iterrows():
NameError: name 'segments' is not defined
"""

When changing the read_csv() functions to include low_memory=False (lines 265, 274 and 294) of the AlignBestSegments.py file the memory error is gone, but the following error persists:

Fixing CDR3 sequences (stage II)
(it may take a while...)
-- processing the vdjdb_full.txt table
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/opt/miniconda3/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/opt/miniconda3/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/Volumes/data_storag/vdjdb/vdjdb-db/src/AlignBestSegments.py", line 132, in fix_json
    for _, seg_row in segments[(segments.species == species) & (segments.gene == seg_gene_type) & (segments.segment == "Variable")].iterrows():
NameError: name 'segments' is not defined
"""

Can you help me to solve this error?

Best wishes,

mariusmessemaker commented 2 years ago

Hi @mikessh mikessh, does the above added tag mean that you are working on the above problem? If so, do you have an estimated date when you think the problem is fixed? Thank you, Marius

mikessh commented 2 years ago

Yes, we are making a draft docker image. Note that it also includes (yet unoptimized) vdjdb-motif which requires ~64Gb ram to run. When we release it perhaps you can modify the image to remove this step and fit it to your needs..

Right now you can check out https://github.com/antigenomics/vdjdb-db/pull/336

qmffkem commented 2 years ago

Hello, I encountered this error as well when tried to build with the most recent version. is there any update regarding this issue? Thank you!