agemagician / ProtTrans

ProtTrans is providing state of the art pretrained language models for proteins. ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using Transformers Models.
Academic Free License v3.0
1.13k stars 153 forks source link

About 2-chars characters in DeepLoc's protein strings #74

Closed ratthachat closed 2 years ago

ratthachat commented 2 years ago

Hi again,

I notice that the DeepLoc dataset given in the Jupyter examples, e.g. https://github.com/agemagician/ProtTrans/blob/master/Fine-Tuning/ProtBert-BFD-FineTuning-MS.ipynb

, contain a lot of "2-chars" character in the protein string. In fact, all alphabets in the training dataset are:

['A' 'AA' 'AC' 'AD' 'AE' 'AF' 'AG' 'AH' 'AI' 'AK' 'AL' 'AM' 'AN' 'AP' 'AQ'
 'AR' 'AS' 'AT' 'AV' 'AW' 'AY' 'B' 'C' 'CA' 'CC' 'CD' 'CE' 'CF' 'CG' 'CH'
 'CI' 'CK' 'CL' 'CM' 'CN' 'CP' 'CQ' 'CR' 'CS' 'CT' 'CV' 'CW' 'CY' 'D' 'DA'
 'DC' 'DD' 'DE' 'DF' 'DG' 'DH' 'DI' 'DK' 'DL' 'DM' 'DN' 'DP' 'DQ' 'DR'
 'DS' 'DT' 'DV' 'DW' 'DY' 'E' 'EA' 'EC' 'ED' 'EE' 'EF' 'EG' 'EH' 'EI' 'EK'
 'EL' 'EM' 'EN' 'EP' 'EQ' 'ER' 'ES' 'ET' 'EV' 'EW' 'EY' 'F' 'FA' 'FC' 'FD'
 'FE' 'FF' 'FG' 'FH' 'FI' 'FK' 'FL' 'FM' 'FN' 'FP' 'FQ' 'FR' 'FS' 'FT'
 'FV' 'FW' 'FY' 'G' 'GA' 'GC' 'GD' 'GE' 'GF' 'GG' 'GH' 'GI' 'GK' 'GL' 'GM'
 'GN' 'GP' 'GQ' 'GR' 'GS' 'GT' 'GV' 'GW' 'GY' 'H' 'HA' 'HC' 'HD' 'HE' 'HF'
 'HG' 'HH' 'HI' 'HK' 'HL' 'HM' 'HN' 'HP' 'HQ' 'HR' 'HS' 'HT' 'HV' 'HW'
 'HY' 'I' 'IA' 'IC' 'ID' 'IE' 'IF' 'IG' 'IH' 'II' 'IK' 'IL' 'IM' 'IN' 'IP'
 'IQ' 'IR' 'IS' 'IT' 'IV' 'IW' 'IY' 'K' 'KA' 'KC' 'KD' 'KE' 'KF' 'KG' 'KH'
 'KI' 'KK' 'KL' 'KM' 'KN' 'KP' 'KQ' 'KR' 'KS' 'KT' 'KV' 'KW' 'KY' 'L' 'LA'
 'LC' 'LD' 'LE' 'LF' 'LG' 'LH' 'LI' 'LK' 'LL' 'LM' 'LN' 'LP' 'LQ' 'LR'
 'LS' 'LT' 'LV' 'LW' 'LY' 'M' 'MA' 'MC' 'MD' 'ME' 'MF' 'MG' 'MH' 'MI' 'MK'
 'ML' 'MM' 'MN' 'MP' 'MQ' 'MR' 'MS' 'MT' 'MV' 'MW' 'MY' 'N' 'NA' 'NC' 'ND'
 'NE' 'NF' 'NG' 'NH' 'NI' 'NK' 'NL' 'NM' 'NN' 'NP' 'NQ' 'NR' 'NS' 'NT'
 'NV' 'NW' 'NY' 'P' 'PA' 'PC' 'PD' 'PE' 'PF' 'PG' 'PH' 'PI' 'PK' 'PL' 'PM'
 'PN' 'PP' 'PQ' 'PR' 'PS' 'PT' 'PV' 'PW' 'PY' 'Q' 'QA' 'QC' 'QD' 'QE' 'QF'
 'QG' 'QH' 'QI' 'QK' 'QL' 'QM' 'QN' 'QP' 'QQ' 'QR' 'QS' 'QT' 'QV' 'QW'
 'QY' 'R' 'RA' 'RC' 'RD' 'RE' 'RF' 'RG' 'RH' 'RI' 'RK' 'RL' 'RM' 'RN' 'RP'
 'RQ' 'RR' 'RS' 'RT' 'RV' 'RW' 'RY' 'S' 'SA' 'SC' 'SD' 'SE' 'SF' 'SG' 'SH'
 'SI' 'SK' 'SL' 'SM' 'SN' 'SP' 'SQ' 'SR' 'SS' 'ST' 'SV' 'SW' 'SY' 'T' 'TA'
 'TC' 'TD' 'TE' 'TF' 'TG' 'TH' 'TI' 'TK' 'TL' 'TM' 'TN' 'TP' 'TQ' 'TR'
 'TS' 'TT' 'TV' 'TW' 'TY' 'U' 'V' 'VA' 'VC' 'VD' 'VE' 'VF' 'VG' 'VH' 'VI'
 'VK' 'VL' 'VM' 'VN' 'VP' 'VQ' 'VR' 'VS' 'VT' 'VV' 'VW' 'VY' 'W' 'WA' 'WC'
 'WD' 'WE' 'WF' 'WG' 'WH' 'WI' 'WK' 'WL' 'WM' 'WN' 'WP' 'WQ' 'WR' 'WS'
 'WT' 'WV' 'WW' 'WY' 'X' 'Y' 'YA' 'YC' 'YD' 'YE' 'YF' 'YG' 'YH' 'YI' 'YK'
 'YL' 'YM' 'YN' 'YP' 'YQ' 'YR' 'YS' 'YT' 'YV' 'YW' 'YY']

However, as ProtBERT's vocab.txt for example according to Huggingface dataset contains only single character. Therefore, the tokenizer will tokenize these 2-chars as [UNK]. Therefore, shouldn't we convert these 2-chars into 1-char e.g. 'VW' --> 'V' ?

mheinzinger commented 2 years ago

Hi :) good spot. As I was not involved in the creation of those notebooks, I can not explain to you how this happened but you are absolutely right: all our models process solely single characters/amino acid so any "2-char" combination will be mapped to [UNK], thereby, essentially dropping any information from those positions. From a quick glance at the structure of those files, I've created a (now hopefully correct) version which you can download here:

ratthachat commented 2 years ago

Thanks so much Michael!! I will surely find time to play around this new dataset and report you back this week!

ratthachat commented 2 years ago

@mheinzinger Michael, thanks for the updated files. Upon quick investigation, I have noticed two aspects which different from the original version:

1) in deeploc_our_train_set.csv and deeploc_our_val_set.csv , wrt membrane column/task, previously there are 3 possible classes{'M', 'S', 'U'} while in the new version there have 4 classes {'M', 'Nucleus', 'S', 'U'} -- so the 'Nucleus' class is new.

So there are label changes, but with the same training data. Is this correct? The deeploc_test_set.csv still have only 3 classes, though.

2) In contrast, in setHARD.csv , all membrane is labeled as'U' . Presumably, we don't have ground truths on this new set?

Thanks so much! Jung (Ratthachat)

mheinzinger commented 2 years ago

Good spot, thanks! - I completely missed that a few proteins are labeled with "Cytoplasm-Nucleus" (not only "Cytoplasm" or "Nucleus") which messed up my parser.. It only affected train/val as test did not contain such labels. I fixed it now and replaced the files on the server. Simply re-downloading should solve your issue so that you only end up with "M"/"S"/"U" as possible classes. During the design of setHARD, we mainly focused on the subcellular localization, less on whether those proteins are membrane-bound or not. You could probably still get a proxy by using proteins in setHARD labeled with "Cell.membrane" as membrane proteins and all others as soluble. However, this shortcut would introduce some false-negatives in your groundtruth as it misses all proteins attached to other membranes than the Cell.membrane (e.g. nucleus membrane). The best way would be to use the UniProt-IDs of the setHARD proteins and check UniProt/SwissProt for any sort of membrane annotation.

ratthachat commented 2 years ago

Thanks so much for your kind explanation, Michael! @mheinzinger I have a small question before closing this issue.

In the original dataset posted by @agemagician i.e.

deeplocDatasetTrainUrl = 'https://www.dropbox.com/s/vgdqcl4vzqm9as0/deeploc_per_protein_train.csv?dl=1'
deeplocDatasetValidUrl = 'https://www.dropbox.com/s/jfzuokrym7nflkp/deeploc_per_protein_test.csv?dl=1'

There are only 10 classes of 'loc' i.e. there is no class of "Cytoplasm-Nucleus" ... (In the new dataset, we now have 11 classes) So to be consistent with the experiment in the ProtTrans paper, should we treat this new class as just "Cytoplasm" or just "Nucleus" or do you have other recommendation ??

mheinzinger commented 2 years ago

Yes, that is correct; I quickly double checked and during the ProtTrans evaluation I dropped proteins with mutliple subcellular localizations (should only affect those with "Cytoplasm-Nucleus" which are relatively few proteins). So if you skip those, you should be able to reproduce ProtTrans results. In any way, I would highly recommend to use the setHARD.fasta/setHARD.csv from the Light-Attention (https://github.com/HannesStark/protein-localization) for final evaluation as it gives (in our/my opinion) a more robust (but also more conservative) performance estimate.

ratthachat commented 2 years ago

Thanks so much Michael. I think everything is clear now. I will study the light-attention paper soon. BTW, maybe this dataset issue should be mentioned somewhere on the benchmark section? (especially when the original dataset is still used in the Colab examples) We can close this issue up to your judgement.

XinshaoAmosWang commented 2 years ago

Hi @ratthachat , @mheinzinger @agemagician I noted that the original dataset and notebook example are fine. As when reading the amino acid sequence, " ".join("".join(seq.split())) will parse the sequence to be space-separated properly.

I hope this information helps.

mheinzinger commented 2 years ago

Good spot, thanks for the heads-up. I would still recommend to switch to the above discussed files as those will allow you for straightforward comparison to the Light-Attention paper.

ratthachat commented 2 years ago

Hi Michael, Xinshao, Regarding to @XinshaoAmosWang suggestion, I realize I am confused the real meaning of this 2-char characters appeared in the original dataset. So I have a question to @mheinzinger

Which is the original meaning of 'AA', 'AC', 'AD', 'AE', etc. occurred in the original dataset between :

(1) They are typos , and they are actually really two protein bases where 'AA' = 'A A', 'AC' = 'A C', etc. (so the method that Xinshao suggested makes sense) or (2) they are subtypes of the 'A' string, e.g. 'AA' and 'AC' are some mutated/variants of 'A'. (so that the method that Xinshao suggested may not work)


PS1 Beside this dataset issue, the notebooks in the repo are outdated in the sense that it works only the old pytorch-lightning version. I modify it on Kaggle notebook so that it always has P100 GPU, work on the new dataset and ensure to support the latest lightning version. If this is useful for the repo, please let me know so that I will clean it up and share here.

PS2 I already read the light-attention paper. Thanks for suggestion!

mheinzinger commented 2 years ago

They were typos; at least I see absolutely no reason to have them. Even if they were mutated/variants of "A" then ProtBERT never saw them during pre-training which renders adding them only during fine-tuning questionable. If you got an updated version of the notebook, I would be happy to include it! :)

ratthachat commented 2 years ago

Hi Michael @mheinzinger, thanks. Here an updated notebook on ProtBERT lightning multitasks classifier https://www.kaggle.com/code/ratthachat/prottrans-lightning-multitasks/notebook

Compared to the original notebook, this notebook

mheinzinger commented 2 years ago

Thanks for sharing @ratthachat ! From a quick look I just realized that the final performance/accuracy on the test set seems to be 11%. If my interpretation is correct, I would assume that something is currently going wrong as random performance is probably around 11%. Also, I could not access the weights-and-biases results page (not even sure whether that would be intended from your end but I thought, I could check the evaluation there in more detail).

ratthachat commented 2 years ago

Hi Michael @mheinzinger , thanks and my apology! The bug is fixed in Notebook's version.10, and now WandB stat is also public so everybody can track the reducing loss.

I just train using 5 epochs to save time, and the test accuracy is around 80% now. Note that train-accuracy is sometimes largely fluctuate due to batch-size=1. Note also that I separate loc-loss and membrane-loss in WandB so we can track these metrics separately in multi-task setting.

I found out that with only 13GB RAM of Kaggle notebook, max protein len of 500-700 are possible to test run (with full precision), but longer than that, the notebook could crash during training or testing. Mixed-precision is quite difficult to set within Kaggle so I skip it for now.

PS. just to be complete, the previous bug was due to multi-task losses were store in Python's list and pytorch lightning could not handle this native list so that gradient was not propagate back to other variables.

Jung (Ratthachat)

mheinzinger commented 2 years ago

Hi Jung @ratthachat That's great news. Would you mind if I added your notebook to the examples that we provide so that other users might find it easier? On the memory problem for fine-tuning on long proteins: I am not an expert on that domain but there seem to be some strategies that allow you to fine-tune large models via offloading certain parameters to CPU, e.g. https://pytorch-lightning.readthedocs.io/en/stable/advanced/model_parallel.html#deepspeed-zero-stage-3-offload

Maybe that helps? - Please keep me posted on that as I am also curious whether that works :D

ratthachat commented 2 years ago

Hi Michael! I would be very happy to have a notebook on your repo's examples directory! About the memory issue, I think mixed precision should be the easiest way to go. Previously I could set it up without problem; I will find the time to try again, also with the zero-offload technique you mentioned!

Thanks Jung

mheinzinger commented 2 years ago

Completely forgot to come back to you. Sorry for that. I've added your fine-tuning example already a month ago to our repo (just forgot to answer you here): https://github.com/agemagician/ProtTrans/blob/master/Fine-Tuning/protBERT-BFD-lightning-multitasks.ipynb

Thanks a lot again!