Sanofi-Public / CodonBERT

Repository for mRNA Paper and CodonBERT publication.
Other
103 stars 16 forks source link

Inquiry Regarding Vocabulary Size in CodonBERT PyTorch Model #6

Open hjqqqq-bot opened 2 months ago

hjqqqq-bot commented 2 months ago

I would like to express my gratitude for your excellent work on CodonBERT. I have been thoroughly impressed by your research and the accompanying code.

However, I have encountered a discrepancy that I would like to clarify. In your paper and code, the vocabulary size is mentioned as 555+5=130, based on the characters 'A', 'U', 'G', 'C', and 'N'. Yet, in the CodonBERT PyTorch model you provided, the vocabulary size is set to 69.

Could you please explain the rationale behind this difference in vocabulary size? Understanding this would greatly help me in comprehending and utilizing your model more effectively.

Thank you in advance for your assistance.

a253324 commented 2 months ago

I also nocticed this problem. Did is this a mistake when the author provided the model? Cause of this problem, to run the fintune code successfully, I have to modified the model code. I modified the 69 to 130, and maintaned the original weight of the original model, adding 0 or randomly normalizing for the rest weight. But this operation may have an influence on the model performence. If this is a mistake, I hope author could provide correct CondonBERT model code. I would apprecitae it.

whql251 commented 1 month ago

I noticed that there are 64 different codons and 5 special tokens, adding up to 69 the vocabulary size. But the order of codons in the vocabulary table still remains a problem. Could you kindly provide the correct vocabulary table or clarify the order of the codons? This information would be very helpful.

a253324 commented 1 month ago

I noticed that there are 64 different codons and 5 special tokens, adding up to 69 the vocabulary size. But the order of codons in the vocabulary table still remains a problem. Could you kindly provide the correct vocabulary table or clarify the order of the codons? This information would be very helpful.

In the pretrain.py and finetune.py, after data processing, the variable dic_voc storage the vocabulary table, and the output is as: {'[PAD]': 0, '[UNK]': 1, '[CLS]': 2, '[SEP]': 3, '[MASK]': 4, 'AAA': 5, 'AAU': 6, 'AAG': 7, 'AAC': 8, 'AAN': 9, 'AUA': 10, 'AUU': 11, 'AUG': 12, 'AUC': 13, 'AUN': 14, 'AGA': 15, 'AGU': 16, 'AGG': 17, 'AGC': 18, 'AGN': 19, 'ACA': 20, 'ACU': 21, 'ACG': 22, 'ACC': 23, 'ACN': 24, 'ANA': 25, 'ANU': 26, 'ANG': 27, 'ANC': 28, 'ANN': 29, 'UAA': 30, 'UAU': 31, 'UAG': 32, 'UAC': 33, 'UAN': 34, 'UUA': 35, 'UUU': 36, 'UUG': 37, 'UUC': 38, 'UUN': 39, 'UGA': 40, 'UGU': 41, 'UGG': 42, 'UGC': 43, 'UGN': 44, 'UCA': 45, 'UCU': 46, 'UCG': 47, 'UCC': 48, 'UCN': 49, 'UNA': 50, 'UNU': 51, 'UNG': 52, 'UNC': 53, 'UNN': 54, 'GAA': 55, 'GAU': 56, 'GAG': 57, 'GAC': 58, 'GAN': 59, 'GUA': 60, 'GUU': 61, 'GUG': 62, 'GUC': 63, 'GUN': 64, 'GGA': 65, 'GGU': 66, 'GGG': 67, 'GGC': 68, 'GGN': 69, 'GCA': 70, 'GCU': 71, 'GCG': 72, 'GCC': 73, 'GCN': 74, 'GNA': 75, 'GNU': 76, 'GNG': 77, 'GNC': 78, 'GNN': 79, 'CAA': 80, 'CAU': 81, 'CAG': 82, 'CAC': 83, 'CAN': 84, 'CUA': 85, 'CUU': 86, 'CUG': 87, 'CUC': 88, 'CUN': 89, 'CGA': 90, 'CGU': 91, 'CGG': 92, 'CGC': 93, 'CGN': 94, 'CCA': 95, 'CCU': 96, 'CCG': 97, 'CCC': 98, 'CCN': 99, 'CNA': 100, 'CNU': 101, 'CNG': 102, 'CNC': 103, 'CNN': 104, 'NAA': 105, 'NAU': 106, 'NAG': 107, 'NAC': 108, 'NAN': 109, 'NUA': 110, 'NUU': 111, 'NUG': 112, 'NUC': 113, 'NUN': 114, 'NGA': 115, 'NGU': 116, 'NGG': 117, 'NGC': 118, 'NGN': 119, 'NCA': 120, 'NCU': 121, 'NCG': 122, 'NCC': 123, 'NCN': 124, 'NNA': 125, 'NNU': 126, 'NNG': 127, 'NNC': 128, 'NNN': 129}

Total 130(0~129). But the model's config is 69.

whql251 commented 1 month ago

I noticed that there are 64 different codons and 5 special tokens, adding up to 69 the vocabulary size. But the order of codons in the vocabulary table still remains a problem. Could you kindly provide the correct vocabulary table or clarify the order of the codons? This information would be very helpful.

In the pretrain.py and finetune.py, after data processing, the variable dic_voc storage the vocabulary table, and the output is as: {'[PAD]': 0, '[UNK]': 1, '[CLS]': 2, '[SEP]': 3, '[MASK]': 4, 'AAA': 5, 'AAU': 6, 'AAG': 7, 'AAC': 8, 'AAN': 9, 'AUA': 10, 'AUU': 11, 'AUG': 12, 'AUC': 13, 'AUN': 14, 'AGA': 15, 'AGU': 16, 'AGG': 17, 'AGC': 18, 'AGN': 19, 'ACA': 20, 'ACU': 21, 'ACG': 22, 'ACC': 23, 'ACN': 24, 'ANA': 25, 'ANU': 26, 'ANG': 27, 'ANC': 28, 'ANN': 29, 'UAA': 30, 'UAU': 31, 'UAG': 32, 'UAC': 33, 'UAN': 34, 'UUA': 35, 'UUU': 36, 'UUG': 37, 'UUC': 38, 'UUN': 39, 'UGA': 40, 'UGU': 41, 'UGG': 42, 'UGC': 43, 'UGN': 44, 'UCA': 45, 'UCU': 46, 'UCG': 47, 'UCC': 48, 'UCN': 49, 'UNA': 50, 'UNU': 51, 'UNG': 52, 'UNC': 53, 'UNN': 54, 'GAA': 55, 'GAU': 56, 'GAG': 57, 'GAC': 58, 'GAN': 59, 'GUA': 60, 'GUU': 61, 'GUG': 62, 'GUC': 63, 'GUN': 64, 'GGA': 65, 'GGU': 66, 'GGG': 67, 'GGC': 68, 'GGN': 69, 'GCA': 70, 'GCU': 71, 'GCG': 72, 'GCC': 73, 'GCN': 74, 'GNA': 75, 'GNU': 76, 'GNG': 77, 'GNC': 78, 'GNN': 79, 'CAA': 80, 'CAU': 81, 'CAG': 82, 'CAC': 83, 'CAN': 84, 'CUA': 85, 'CUU': 86, 'CUG': 87, 'CUC': 88, 'CUN': 89, 'CGA': 90, 'CGU': 91, 'CGG': 92, 'CGC': 93, 'CGN': 94, 'CCA': 95, 'CCU': 96, 'CCG': 97, 'CCC': 98, 'CCN': 99, 'CNA': 100, 'CNU': 101, 'CNG': 102, 'CNC': 103, 'CNN': 104, 'NAA': 105, 'NAU': 106, 'NAG': 107, 'NAC': 108, 'NAN': 109, 'NUA': 110, 'NUU': 111, 'NUG': 112, 'NUC': 113, 'NUN': 114, 'NGA': 115, 'NGU': 116, 'NGG': 117, 'NGC': 118, 'NGN': 119, 'NCA': 120, 'NCU': 121, 'NCG': 122, 'NCC': 123, 'NCN': 124, 'NNA': 125, 'NNU': 126, 'NNG': 127, 'NNC': 128, 'NNN': 129}

Total 130(0~129). But the model's config is 69.

To my knowledge, There are only 64 kinds of valid codons. Codons and corresponding amino acids are listed below: [['CCT', 'P'], ['TCC', 'S'], ['GGA', 'G'], ['TAA', 'X'], ['GAG', 'E'], ['TGG', 'W'], ['AAG', 'K'], ['CTG', 'L'], ['AGA', 'R'], ['TTA', 'L'], ['GTG', 'V'], ['CAG', 'Q'], ['ATG', 'M'], ['CGA', 'R'], ['ACT', 'T'], ['GCT', 'A'], ['TCT', 'S'], ['CCC', 'P'], ['CGG', 'R'], ['ATA', 'I'], ['CAA', 'Q'], ['GTA', 'V'], ['TTG', 'L'], ['AGG', 'R'], ['CTA', 'L'], ['AAA', 'K'], ['TGA', 'X'], ['GAA', 'E'], ['TAG', 'X'], ['GGG', 'G'], ['GCC', 'A'], ['ACC', 'T'], ['ACA', 'T'], ['GCA', 'A'], ['TCG', 'S'], ['GAC', 'D'], ['TGC', 'C'], ['TTT', 'F'], ['CGT', 'R'], ['AAC', 'N'], ['CTC', 'L'], ['GTC', 'V'], ['GGT', 'G'], ['TAT', 'Y'], ['AGT', 'S'], ['CAC', 'H'], ['ATC', 'I'], ['CCA', 'P'], ['CCG', 'P'], ['CTT', 'L'], ['AAT', 'N'], ['CGC', 'R'], ['TTC', 'F'], ['TGT', 'C'], ['GAT', 'D'], ['ATT', 'I'], ['CAT', 'H'], ['AGC', 'S'], ['TAC', 'Y'], ['GGC', 'G'], ['GTT', 'V'], ['TCA', 'S'], ['GCG', 'A'], ['ACG', 'T']] This may explain the vocabulary size(69) in the model config

a253324 commented 1 month ago

I noticed that there are 64 different codons and 5 special tokens, adding up to 69 the vocabulary size. But the order of codons in the vocabulary table still remains a problem. Could you kindly provide the correct vocabulary table or clarify the order of the codons? This information would be very helpful.

In the pretrain.py and finetune.py, after data processing, the variable dic_voc storage the vocabulary table, and the output is as: {'[PAD]': 0, '[UNK]': 1, '[CLS]': 2, '[SEP]': 3, '[MASK]': 4, 'AAA': 5, 'AAU': 6, 'AAG': 7, 'AAC': 8, 'AAN': 9, 'AUA': 10, 'AUU': 11, 'AUG': 12, 'AUC': 13, 'AUN': 14, 'AGA': 15, 'AGU': 16, 'AGG': 17, 'AGC': 18, 'AGN': 19, 'ACA': 20, 'ACU': 21, 'ACG': 22, 'ACC': 23, 'ACN': 24, 'ANA': 25, 'ANU': 26, 'ANG': 27, 'ANC': 28, 'ANN': 29, 'UAA': 30, 'UAU': 31, 'UAG': 32, 'UAC': 33, 'UAN': 34, 'UUA': 35, 'UUU': 36, 'UUG': 37, 'UUC': 38, 'UUN': 39, 'UGA': 40, 'UGU': 41, 'UGG': 42, 'UGC': 43, 'UGN': 44, 'UCA': 45, 'UCU': 46, 'UCG': 47, 'UCC': 48, 'UCN': 49, 'UNA': 50, 'UNU': 51, 'UNG': 52, 'UNC': 53, 'UNN': 54, 'GAA': 55, 'GAU': 56, 'GAG': 57, 'GAC': 58, 'GAN': 59, 'GUA': 60, 'GUU': 61, 'GUG': 62, 'GUC': 63, 'GUN': 64, 'GGA': 65, 'GGU': 66, 'GGG': 67, 'GGC': 68, 'GGN': 69, 'GCA': 70, 'GCU': 71, 'GCG': 72, 'GCC': 73, 'GCN': 74, 'GNA': 75, 'GNU': 76, 'GNG': 77, 'GNC': 78, 'GNN': 79, 'CAA': 80, 'CAU': 81, 'CAG': 82, 'CAC': 83, 'CAN': 84, 'CUA': 85, 'CUU': 86, 'CUG': 87, 'CUC': 88, 'CUN': 89, 'CGA': 90, 'CGU': 91, 'CGG': 92, 'CGC': 93, 'CGN': 94, 'CCA': 95, 'CCU': 96, 'CCG': 97, 'CCC': 98, 'CCN': 99, 'CNA': 100, 'CNU': 101, 'CNG': 102, 'CNC': 103, 'CNN': 104, 'NAA': 105, 'NAU': 106, 'NAG': 107, 'NAC': 108, 'NAN': 109, 'NUA': 110, 'NUU': 111, 'NUG': 112, 'NUC': 113, 'NUN': 114, 'NGA': 115, 'NGU': 116, 'NGG': 117, 'NGC': 118, 'NGN': 119, 'NCA': 120, 'NCU': 121, 'NCG': 122, 'NCC': 123, 'NCN': 124, 'NNA': 125, 'NNU': 126, 'NNG': 127, 'NNC': 128, 'NNN': 129} Total 130(0~129). But the model's config is 69.

To my knowledge, There are only 64 kinds of valid codons. Codons and corresponding amino acids are listed below: [['CCT', 'P'], ['TCC', 'S'], ['GGA', 'G'], ['TAA', 'X'], ['GAG', 'E'], ['TGG', 'W'], ['AAG', 'K'], ['CTG', 'L'], ['AGA', 'R'], ['TTA', 'L'], ['GTG', 'V'], ['CAG', 'Q'], ['ATG', 'M'], ['CGA', 'R'], ['ACT', 'T'], ['GCT', 'A'], ['TCT', 'S'], ['CCC', 'P'], ['CGG', 'R'], ['ATA', 'I'], ['CAA', 'Q'], ['GTA', 'V'], ['TTG', 'L'], ['AGG', 'R'], ['CTA', 'L'], ['AAA', 'K'], ['TGA', 'X'], ['GAA', 'E'], ['TAG', 'X'], ['GGG', 'G'], ['GCC', 'A'], ['ACC', 'T'], ['ACA', 'T'], ['GCA', 'A'], ['TCG', 'S'], ['GAC', 'D'], ['TGC', 'C'], ['TTT', 'F'], ['CGT', 'R'], ['AAC', 'N'], ['CTC', 'L'], ['GTC', 'V'], ['GGT', 'G'], ['TAT', 'Y'], ['AGT', 'S'], ['CAC', 'H'], ['ATC', 'I'], ['CCA', 'P'], ['CCG', 'P'], ['CTT', 'L'], ['AAT', 'N'], ['CGC', 'R'], ['TTC', 'F'], ['TGT', 'C'], ['GAT', 'D'], ['ATT', 'I'], ['CAT', 'H'], ['AGC', 'S'], ['TAC', 'Y'], ['GGC', 'G'], ['GTT', 'V'], ['TCA', 'S'], ['GCG', 'A'], ['ACG', 'T']] This may explain the vocabulary size(69) in the model config

Year that's correct. Thank you for reply. After my last reply, I immediately checked the code again. Maybe I knew why the difference appeared. In the finetune.py and pretrain.py, the variable lst_ele( line 70 in pretrain.py) is a list of ('AUGCN'),so after processing, there are 130 vocabulary. When I modified the list to ('AUGC'), there are 69 vocabulary, which is consisdent with the model's config. In terms of this result, the another problem appeared: the model's config author provided is 69, did this means they pretained the model on the setting of list('AUGC') rather than list('AUGCN')? If it is, the model's code will be not consisdent with the paper, where they mentionend they acctually pretrained on the setting of list('AUGCN').