TobiasHeOl / kasearch

KA-Search: Rapid and exhaustive sequence identity search of known antibodies
BSD 3-Clause "New" or "Revised" License
10 stars 9 forks source link

Error when running EasySearch: only results for "Identity" column #9

Open lauratwomey opened 1 week ago

lauratwomey commented 1 week ago

Dear kasearch team,

First of all, thanks for all your work, kasearch is really promising!! I'm really hoping I can get it running soon.

I'm trying to run EasySearch on the sample sequence. I downloaded the publication dataset into this folder: /researchers/laura.twomey/Tools/omics_tools/kasearch/oasdb_20230111/

from kasearch import EasySearch
# Run ka search
results = EasySearch('QVKLQESGAELARPGASVKLSCKASGYTFTNYWMQWVKQRPGQGLDWIGAIYPGDGNTRYTHKFKGKATLTADKSSSTAYMQLSSLASEDSGVYYCARGEGNYAWFAYWGQGTTVTVSS',
    allowed_chain='Heavy',  
    allowed_species='Human', 
    regions=['whole'],  
    length_matched=[False], 
    database_path='/researchers/laura.twomey/Tools/omics_tools/kasearch/oasdb_20230111/'
)

But get this error:

Traceback (most recent call last):
  File "/home/ltwomey/src/Analysis/scRNAseq/run_kasearch.py", line 15, in <module>
    results = EasySearch(
              ^^^^^^^^^^^
  File "/scratch/users/ltwomey/condaenvs/kasearch_new/lib/python3.12/site-packages/kasearch/easy_search.py", line 56, in EasySearch
    return targetdb.get_meta(n_query=0, n_region=0, n_sequences='all', n_jobs=n_jobs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/users/ltwomey/condaenvs/kasearch_new/lib/python3.12/site-packages/kasearch/kasearch.py", line 150, in get_meta
    metadf = self._extract_meta(self.current_best_ids[n_query, :n_sequences, n_region], n_jobs=n_jobs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/users/ltwomey/condaenvs/kasearch_new/lib/python3.12/site-packages/kasearch/meta_extract.py", line 87, in _extract_meta
    fetched_metadata = pd.concat(Parallel(n_jobs=n_jobs)(delayed(self._get_single_study_meta)(group) for group in groups))
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/users/ltwomey/condaenvs/kasearch_new/lib/python3.12/site-packages/joblib/parallel.py", line 1918, in __call__
    return output if self.return_generator else list(output)
                                                ^^^^^^^^^^^^
  File "/scratch/users/ltwomey/condaenvs/kasearch_new/lib/python3.12/site-packages/joblib/parallel.py", line 1847, in _get_sequential_output
    res = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/users/ltwomey/condaenvs/kasearch_new/lib/python3.12/site-packages/kasearch/meta_extract.py", line 47, in _get_single_study_meta
    study_file = self.id_to_study[study_id]
                 ~~~~~~~~~~~~~~~~^^^^^^^^^^
KeyError: np.int64(3595)

I'm using:

lauratwomey commented 6 days ago

Update! I figured out I was getting the issue above when removing the "Bender et al lines" from the id_to_study.txt file. When I use the original id_to_study.txt file from the 2023 OAS-aligned (63GB), kasearch runs but outputs an empty dataframe (see below). There are only 8 lines with the Identity values, the rest are empty. I am unsure whether this is because of the Bender et al being removed from OAS, or if I am not using EasySearch correctly - any help would be greatly appreciated!

Could you let me know how to get the latest pre-aligned version of OAS?

I am running the command from the issue above:

Analysis starting at: 2024-07-05 14:57:16.627652
Running Easy Search...................................................

Heavy chain data in Bender et al. 2020 has been removed from OAS due to contamination.
Heavy chain data in Bender et al. 2020 has been removed from OAS due to contamination.
Heavy chain data in Bender et al. 2020 has been removed from OAS due to contamination.
Heavy chain data in Bender et al. 2020 has been removed from OAS due to contamination.
Finished Easy Search...................................................

Saving results...................................................

  Unnamed: 0 sequence locus  ... Total sequences Isotype  Identity
0        NaN      NaN   NaN  ...             NaN     NaN  0.899160
1        NaN      NaN   NaN  ...             NaN     NaN  0.899160
2        NaN      NaN   NaN  ...             NaN     NaN  0.892562
3        NaN      NaN   NaN  ...             NaN     NaN  0.892562
4        NaN      NaN   NaN  ...             NaN     NaN  0.890756

[5 rows x 114 columns]
Analysis finished at: 2024-07-05 15:30:56.135003
TobiasHeOl commented 2 days ago

Hi Laura, thank you for using KA-Search and highlighting this issue!

Some time ago we decided to remove parts of the Bender 2020 study from OAS because we suspect some of the human sequences contain mouse sequences. However, because this would break the public pre-processed OAS for KA-Search, we updated the kasearch code to highlight when user queries would match with Bender 2020 sequences. This results in results without meta data, as the meta data is not in OAS any more. Unfortunately, we left a sequence which matches with Bender 2020 sequences as the example sequence, this has now been changed (#10).

For convenience, you can create your own pre-aligned version of OAS using the prepareOASdb.ipynb notebook. This will take some time or resources (~1 day on 20 CPUs), but you will then have an up-to-date pre-aligned version of OAS.

I hope this helps, otherwise please let me know if you have any other issues.