Rappsilber-Laboratory / AlphaLink2

AlphaLink2: Integrating crosslinking MS data into Uni-Fold-Multimer
Creative Commons Attribution 4.0 International
50 stars 16 forks source link

Features pkl has a different name #4

Closed sami-chaaban closed 1 year ago

sami-chaaban commented 1 year ago

My run seems to fail due to the feature pkl file not existing, see error below. It is looking for A.feature.pkl.gz in test-hex7/XI_VIEW_SY_XL_5_DSSO_XL_redo_sequences but the only such file that I can find is test-hex7/XI/VIEW.feature.pkl.gz. Does the filename for the fasta file dictate the feature filename?

I0629 22:59:16.665984 23117938936704 utils.py:36] Started Jackhmmer (uniprot.fasta) query I0629 23:07:53.910510 23117938936704 utils.py:40] Finished Jackhmmer (uniprot.fasta) query in 517.244 seconds I0629 23:08:16.925292 23117938936704 homo_search.py:203] Final timings for XI_VIEW_SY_XL_5_DSSO_XL_redo_sequences_A: {'features': 2645.1914842128754, 'all_seq_features': 539.0654988288879} I0629 23:08:16.934574 23117938936704 homo_search.py:161] searching homogeneous Sequences & structures for XI_VIEW_SY_XL_5_DSSO_XL_redo_sequences_B... I0629 23:08:16.935142 23117938936704 homo_search.py:203] Final timings for XI_VIEW_SY_XL_5_DSSO_XL_redo_sequences_B: {} I0629 23:08:16.939137 23117938936704 homo_search.py:161] searching homogeneous Sequences & structures for XI_VIEW_SY_XL_5_DSSO_XL_redo_sequences_C... I0629 23:08:16.939221 23117938936704 homo_search.py:203] Final timings for XI_VIEW_SY_XL_5_DSSO_XL_redo_sequences_C: {} I0629 23:08:16.970316 23117938936704 homo_search.py:161] searching homogeneous Sequences & structures for XI_VIEW_SY_XL_5_DSSO_XL_redo_sequences_D... I0629 23:08:16.970479 23117938936704 homo_search.py:203] Final timings for XI_VIEW_SY_XL_5_DSSO_XL_redo_sequences_D: {} Traceback (most recent call last): File "/cephfs2/public/AlphaLink2/AlphaLink2/inference.py", line 363, in main(args) File "/cephfs2/public/AlphaLink2/AlphaLink2/inference.py", line 141, in main batch = load_feature_for_one_target( File "/cephfs2/public/AlphaLink2/AlphaLink2/inference.py", line 74, in load_feature_for_onetarget batch, = load_and_process( File "/cephfs2/public/AlphaLink2/AlphaLink2/unifold/dataset.py", line 300, in load_and_process features, labels = load(load_kwargs, mode=mode, is_monomer=is_monomer) File "/cephfs2/public/AlphaLink2/AlphaLink2/unifold/dataset.py", line 202, in load all_chain_features = [ File "/cephfs2/public/AlphaLink2/AlphaLink2/unifold/dataset.py", line 203, in load_single_feature(s, monomer_feature_dir, mode, uniprot_msa_dir, is_monomer) File "/cephfs2/public/AlphaLink2/AlphaLink2/unifold/data/utils.py", line 33, in wrapper return copy_lib.copy(cached_func(*args, *kwargs)) File "/cephfs2/public/AlphaLink2/AlphaLink2/unifold/dataset.py", line 78, in load_single_feature monomer_feature = utils.load_pickle( File "/cephfs2/public/AlphaLink2/AlphaLink2/unifold/data/utils.py", line 33, in wrapper return copy_lib.copy(cached_func(args, kwargs)) File "/cephfs2/public/AlphaLink2/AlphaLink2/unifold/data/utils.py", line 67, in load_pickle ret = load(path) File "/cephfs2/public/AlphaLink2/AlphaLink2/unifold/data/utils.py", line 64, in load with open_fn(path, "rb") as f: File "/cephfs2/public/AlphaLink2/alphalinkconda/envs/alphalink/lib/python3.9/gzip.py", line 58, in open binary_file = GzipFile(filename, gz_mode, compresslevel) File "/cephfs2/public/AlphaLink2/alphalinkconda/envs/alphalink/lib/python3.9/gzip.py", line 173, in init fileobj = self.myfileobj = builtins.open(filename, mode or 'rb') FileNotFoundError: [Errno 2] No such file or directory: 'test-hex7/XI_VIEW_SY_XL_5_DSSO_XL_redo_sequences/A.feature.pkl.gz'

lhatsk commented 1 year ago

Strange! No, the feature names are determined by the chains, but the name of the folder (within the output folder) is named after the fasta file. This is a relict from the original Uni-Fold pipeline, I should probably change this to directly use the output folder, makes more sense.

What does your chains.txt look like?

sami-chaaban commented 1 year ago

What does your chains.txt look like?

A B C D

lhatsk commented 1 year ago

I think I see what the issue is. Uni-Fold expects zero or one underscores in the sequence identifier and uses the part after the first underscore as the chain id. I am not sure I have the time to fix it today. A quick solution should be to rename the chains in the FASTA file. Easiest would be just A B C D. My guess is your feature files were overwritten unless it's a homomer you would need to re-run the MSA pipeline. Sorry for the trouble

sami-chaaban commented 1 year ago

No worries! I'll give this a shot and report back.

lhatsk commented 1 year ago

Hm, works actually fine in my small test. Could you attach the FASTA?

lhatsk commented 1 year ago

So it looks like it's not the chain naming, but the naming of the FASTA file. Shouldn't include underscores unless it is a divider for the chain.

sami-chaaban commented 1 year ago

Ok sounds good. Thanks again for testing this. I'll try it when I have access to the files in a bit

lhatsk commented 1 year ago

I pushed a fix. Let me know if it solves your issue.

sami-chaaban commented 1 year ago

Fixed, thanks!