Modelangelo v1.0.1 failing if the fasta contains unknown residues "X"

sroet commented 1 year ago

Hey,

I have a fasta file where certain residues are unknown and therefor represented with X such as (1 at the start and 1 at place 105):

>chain 'CM'
XPFKRFVEIGRVALVNYGKDYGRLVVIVDVVDQNRALVDAPDMVRCQINFKRLSLTDIKIDIKRVPKKTTLIKAMEEADVKNKWENSSWGKKLIVQKRRASLNDXDRFKVMLAKIKRGGAIRQELAKLKKTAAA

When trying to build against these fasta sequences you get an internal assertion error:

click here for the log file

``` 2023-06-15 at 17:54:09 | INFO | ModelAngelo with args: {'volume_path': '../sharpened.mrc', 'protein_fasta': '../fasta_files/proteins.fa', 'rna_fasta': '../fasta_files/rna.fa', 'dna_fasta': None, 'output_dir': '20230615_fasta', 'mask_path': None, 'device': '0', 'config_path': None, 'model_bundle_name': 'nucleotides', 'model_bundle_path': None, 'keep_intermediate_results': False, 'pipeline_control': False, 'func': } 2023-06-15 at 17:54:09 | INFO | Initial C-alpha prediction with args: {'model_checkpoint': 'chkpt.torch', 'bfactor': 0, 'batch_size': 4, 'box_size': 64, 'stride': 16, 'dont_mask_input': True, 'threshold': 0.05, 'save_real_coordinates': False, 'save_cryo_em_grid': False, 'do_nucleotides': True, 'save_backbone_trace': False, 'save_output_grid': False, 'crop': 6, 'log_dir': '/data/public/model_angelo_weights/hub/checkpoints/model_angelo_v1.0/nucleotides/c_alpha', 'map_path': '../sharpened.mrc', 'output_path': '20230615_fasta/see_alpha_output', 'mask_path': None, 'device': '0', 'auto_mask': False} 2023-06-15 at 17:54:09 | INFO | Using model file /data/public/model_angelo_weights/hub/checkpoints/model_angelo_v1.0/nucleotides/c_alpha/model.py 2023-06-15 at 17:54:09 | INFO | Using checkpoint file /data/public/model_angelo_weights/hub/checkpoints/model_angelo_v1.0/nucleotides/c_alpha/chkpt.torch 2023-06-15 at 17:54:10 | INFO | Input structure has shape: (162, 162, 162) 2023-06-15 at 17:54:10 | INFO | Running with these arguments: 2023-06-15 at 17:54:10 | INFO | {'model_checkpoint': 'chkpt.torch', 'bfactor': 0, 'batch_size': 4, 'box_size': 64, 'stride': 16, 'dont_mask_input': True, 'threshold': 0.05, 'save_real_coordinates': False, 'save_cryo_em_grid': False, 'do_nucleotides': True, 'save_backbone_trace': False, 'save_output_grid': False, 'crop': 6, 'log_dir': '/data/public/model_angelo_weights/hub/checkpoints/model_angelo_v1.0/nucleotides/c_alpha', 'map_path': '../sharpened.mrc', 'output_path': '20230615_fasta/see_alpha_output', 'mask_path': None, 'device': '0', 'auto_mask': False} 2023-06-15 at 18:01:55 | INFO | Model prediction done, took 465.11 seconds for 343 sliding windows 2023-06-15 at 18:01:55 | INFO | Average time is 1356.012 ms 2023-06-15 at 18:01:55 | INFO | Starting Cα grid to points... 2023-06-15 at 18:01:56 | INFO | Have 17015 Cα points before pruning and 7629 after pruning 2023-06-15 at 18:01:57 | INFO | Starting P grid to points... 2023-06-15 at 18:01:58 | INFO | Have 10785 P points before pruning and 4260 after pruning 2023-06-15 at 18:01:59 | INFO | Finished inference! 2023-06-15 at 18:01:59 | INFO | GNN model refinement round 1 with args: {'num_rounds': 3, 'crop_length': 200, 'repeat_per_residue': 1, 'esm_model': 'esm1b_t33_650M_UR50S', 'aggressive_pruning': True, 'seq_attention_batch_size': 200, 'fp16': False, 'batch_size': 1, 'voxel_size': 1.0, 'map': '../sharpened.mrc', 'protein_fasta': '../fasta_files/proteins.fa', 'rna_fasta': '../fasta_files/rna.fa', 'dna_fasta': None, 'struct': '20230615_fasta/see_alpha_output/see_alpha_merged_output.cif', 'output_dir': '20230615_fasta/gnn_output_round_1', 'model_dir': '/data/public/model_angelo_weights/hub/checkpoints/model_angelo_v1.0/nucleotides/gnn', 'device': '0', 'write_hmm_profiles': False, 'refine': False} 2023-06-15 at 18:01:59 | INFO | Loaded module from step: 483863 2023-06-15 at 18:02:49 | ERROR | Error in ModelAngelo Traceback (most recent call last): File "/opt/apps/miniconda3/envs/model_angelo/bin/model_angelo", line 33, in sys.exit(load_entry_point('model-angelo==1.0.1', 'console_scripts', 'model_angelo')()) │ │ └ │ └ └ File "/opt/apps/miniconda3/envs/model_angelo/lib/python3.9/site-packages/model_angelo-1.0.1-py3.9.egg/model_angelo/__main__.py", line 52, in main args.func(args) │ │ └ Namespace(volume_path='../sharpened.mrc', protein_fasta='../fasta_files/proteins.fa', rna_fasta='../fasta_files/rna.fa', dna_... │ └ └ Namespace(volume_path='../sharpened.mrc', protein_fasta='../fasta_files/proteins.fa', rna_fasta='../fasta_files/rna.fa', dna_... > File "/opt/apps/miniconda3/envs/model_angelo/lib/python3.9/site-packages/model_angelo-1.0.1-py3.9.egg/model_angelo/apps/build.py", line 241, in main gnn_output = gnn_infer(gnn_infer_args) │ └ {'num_rounds': 3, 'crop_length': 200, 'repeat_per_residue': 1, 'esm_model': 'esm1b_t33_650M_UR50S', 'aggressive_pruning': Tru... └ File "/opt/apps/miniconda3/envs/model_angelo/lib/python3.9/site-packages/model_angelo-1.0.1-py3.9.egg/model_angelo/gnn/inference.py", line 92, in infer protein = get_lm_embeddings_for_protein(lang_model, batch_converter, protein) │ │ │ └ Protein(atom_positions=None, atomc_positions=None, aatype=None, atom_mask=None, atomc_mask=None, residue_index=None, chain_in... │ │ └ │ └ ProteinBertModel( │ (embed_tokens): Embedding(33, 1280, padding_idx=1) │ (layers): ModuleList( │ (0): TransformerLayer( │ ... └ File "/opt/apps/miniconda3/envs/model_angelo/lib/python3.9/site-packages/model_angelo-1.0.1-py3.9.egg/model_angelo/data/generate_complete_prot_files.py", line 34, in get_lm_embeddings_for_protein protein_with_lm = add_lm_embeddings_to_protein(protein, lm_embeddings) │ │ └ array([[ 8.0417655e-04, 3.0484083e-01, 6.0511094e-01, ..., │ │ -2.1142796e-01, -2.8297421e-01, -9.1318183e-02], │ │ ... │ └ Protein(atom_positions=None, atomc_positions=None, aatype=None, atom_mask=None, atomc_mask=None, residue_index=None, chain_in... └ File "/opt/apps/miniconda3/envs/model_angelo/lib/python3.9/site-packages/model_angelo-1.0.1-py3.9.egg/model_angelo/utils/protein.py", line 897, in add_lm_embeddings_to_protein assert len(lm_embeddings) == input_protein.unified_seq_len │ │ └ 5488 │ └ Protein(atom_positions=None, atomc_positions=None, aatype=None, atom_mask=None, atomc_mask=None, residue_index=None, chain_in... └ array([[ 8.0417655e-04, 3.0484083e-01, 6.0511094e-01, ..., -2.1142796e-01, -2.8297421e-01, -9.1318183e-02], ... AssertionError: assert len(lm_embeddings) == input_protein.unified_seq_len ```

If I (just) remove the "X" from the fasta sequence it seems to at least build a model without issue (still have to check if it is reasonable for my complete complex).

What would be the best way of dealing with these unknown residues, just delete them, replace them with glycines, or something else? Also, it would probably be nice to catch this issue before the start of the C-alpha prediction

jamaliki commented 1 year ago

Hi Sander,

You are right, this should be handled better by the program. For now, you could put a glycine so that the numbering of the end model is what you expect. But, hopefully I will push out a fix soon.

Best, Kiarash.

ColdPopeye commented 7 months ago

What would be the best way of dealing with these unknown residues, just delete them, replace them with glycines, or something else? Also, it would probably be nice to catch this issue before the start of the C-alpha prediction

I think this is an underrated issue, it doesn't make sense that there isn't a pre-check of the input files prior to the start of the prediction. Sometimes weird formatting or an error in a sequence can take 20 minutes to show up.

jamaliki commented 7 months ago

@ColdPopeye this should be fixed since v1.0.8, does it still give you an issue?

ColdPopeye commented 7 months ago

@ColdPopeye this should be fixed since v1.0.8, does it still give you an issue?

I have a slightly sillier problem, not sure where and when it comes up. I have sequencing results as a word file from which I copy paste them to make a fasta file. Sometimes formatting gets copied or I mis-paste something (I think because windows and WSL behave strangely). In any case the error only comes after the C-alpha prediction which is a bit annoying.

3dem / model-angelo

Modelangelo v1.0.1 failing if the fasta contains unknown residues "X" #53