THUDM / ProteinLM

Protein Language Model
Apache License 2.0
112 stars 20 forks source link

format of the sequence json file, which one? #8

Closed usccolumbia closed 2 years ago

usccolumbia commented 2 years ago

which format should be sequence json file? do we need to add spaces between amino acids?

in: https://github.com/THUDM/ProteinLM/tree/main/pretrain {"text": "GCTVEDRCLIGMGAILLNGCVIGSGSLVAAGALITQ"} {"text": "RTIKVRILHAIGFEGGLMLLTIPMVAYAMDMTLFQAILLDLSMTTCILVYTFIFQWCYDILENR"}

https://github.com/THUDM/ProteinLM/tree/main/pretrain/protein_tools {"text": "G C T V E D R C L I G M G A I L L N G C V I G S G S L V A A G A L I T Q "} {"text": "A D G I N L E I P R G E W I S V I G G N G S G K S T F L K S L I R L E A V K K G R I Y L E G R E L K K W S D R T L Y E K A G F V F Q N P E L Q F I R D T V F D E I A F G A R Q R S W P E E Q V E R K T A E L L Q E F G L D G H Q K A H P F T L S L G Q K R R L S V A T M L L F D Q D L L L L D E P T F "}

Yijia-Xiao commented 2 years ago

Hi @usccolumbia,

The second one is correct. I have fixed it in #11 .

Thanks for your issue!


Best, Yijia