geohot / corona

Reverse engineering SARS-CoV-2
2.48k stars 476 forks source link

Using Natural Language Transformers for Classification #7

Open trisongz opened 4 years ago

trisongz commented 4 years ago

Glad I stumbled upon this project - was working on a theory using the same base dataset.

Since protein/genes are essentially sequences of letters, it led me to the idea of using Transformer models like BERT to classify sequences to their structure. If that theory was valid, I'd want to try a multi-task approach to pairing the valid treatment sequence to the virus sequence and look at whether the model can predict the treatment sequence given the input virus sequence.

I haven't studied the structure as much as you guys probably have - so I'd defer to you on whether this would be plausible/feasible given what we know so far.

Here's a few other starting points I've looked at:

ReSimNet: Drug Response Similarity Prediction using Siamese Neural Networks Jeon and Park et al., 2018

https://github.com/dmis-lab/ReSimNet

BERN is a BioBERT-based multi-type NER tool that also supports normalization of extracted entities.

https://github.com/dmis-lab/bern

geohot commented 4 years ago

Hmm, so I don't know what you mean by "treatment sequence." Usually, I've seem these transformer models trained as big unsupervised predictors of the next character.

trisongz commented 4 years ago

The idea would be modeling it after something like the SQuAD/SWAG dataset for Question Answer, where you have typically a large body of text as initial context (virus sequence), followed by the answer and the positions of the spans for that answer, if found in text (vaccine/cure sequence).

Example of a BioBERT dataset formatted for SQuAD: https://storage.googleapis.com/ce-covid-public/BioASQ-6b/train/Full-Abstract/BioASQ-train-factoid-6b-full-annotated.json

Additional dataset from BioASQ: https://storage.cloud.google.com/ce-covid-public/2ndYearDatasetTask2b.json

I also compiled additional sequence data which may or may not overlap with the download script you had.

https://drive.google.com/drive/folders/18aAuP3OhGMLKV8jZpt_8vpLY5JSqOS9E?usp=sharing

There are 3 sets - Coronaviruses, Influenzaviruses, and SARS related. The jsonl files are the raw data information that was compiled by filtering for complete sequences, and the virus families, and then using the accession code to download the sequences, which are the json files - so they should match the same format as your allseq.json file

amoux commented 4 years ago

@trisongz I downloaded the files and put something together. Let me know if it's similar to what you are suggesting? By the way, I am familiar with the transformers library, and I don't think you can use the pre-trained language models (vocabulary) for these types of sequences. Anyways, here's the Colab link of what I put together - let me know if it's related!

Colab-Notebook

trisongz commented 4 years ago

@amoux That's pretty awesome! I hadn't thought of using a node graph, mainly because I don't work with them as often as I'd like to.

So I've been messing around with different methods and out of the box, transformers won't necessarily work. You pointed out the first one, which is creating the vocabulary. There wasn't a single number that every sequence was divisible by, so what I did instead was process the sequences to find the lowest prime number for that given sequence, and split the sequence by that prime.

## working file - covseq.json

Total Non-Unique Primes: 8297

Total Unique Primes: 1998

Unique Primes: [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31,
35, 37, 41, 43, 45, 47, 49, 50, 53, 54, 57, 59, 61, 62, 63, 67, 71, 73, 
77, 79, 83, 85, 89, 91, 95, 97, 100, 101, 103, 106, 107, 108, 109, 113, 
115, 119, 121, 123, 124, 125, 126, 127, 129, 131, 133, 135, 137, 139, 
143, 145, 149, 151, 155, 157, 161, 163, 167, 171, 173, 175, 179, 181, 
183, 187, 189, 191, 193, 197, 199, 200, 201, 203, 205, 209, 211..]

Afterwards, I compiled all the split sequence chunks into a list, and deduplicated the list to have a remaining list of unique sequence chunks.

fluseq.json has 251607 tokens

covseq.json has 215855 tokens

sarseq.json has 96971 tokens

Total Non-Unique Tokens: 564433

Total Unique Tokens: 208565

ATGGAGAGAATAAAAGAACTGAGAGATCTAATGTCGCAGTCCCGCACTCGCGAGATACTCACTAAGACCACTGTGGACCATATGGCCATAATCAAAAAGTACACATCAGGAAGGCAAGAGAAGAACCCCGCACTCAGAATGAAGTGGATGATGGCAATGAGATACCCAATTACAGCAGACAAGAGAATAATGGACATGATTCCAGAGAGGAATGAACAAGGACAAACCCTCTGGAGCAAAACAAACGATGCTGGATCAGACCGAGTGATGGTATCACCTCTGGCCGTAACATGGTGGAATAGGAATGGCCCAACAACAAGTACAGTTCATTACCCTAAGGTATATAAAACTTATTTCGAAAAGGTCGAAAGGTTGAAACATGGTACCTTCGGCCCTGTCCACTTCAGAAATCAAGTTAAAATAAGGAGGAGAGTTGATACAAACCCTGGCCATGCAGATCTCAGTGCCAAGGAGGCACAGGATGTGATTATGGAAGTTGTTTTCCCAAATGAAGTGGGGGCAAGAATACTGACATCAGAGTCACAGCTGGCAATAACAAAAGAGAAGAAAGAAGAGCTCCAGGATTGTAAAATTGCTCCCTTGATGGTGGCGTACATGCTAGAAAGAGAATTGGTCCGTAAAACAAGGTTTCTCCCAGTAGCCGGCGGAACAGGCAGTGTTTATATTGAAGTGTTGCACTTAACCCAAGGGACGTGCTGGGAGCAGATGTACACTCCAGGAGGAGAAGTGAGAAATGATGATGTTGACCAAAGTTTGATTATCGCTGCTAGAAACATAGTAAGAAGAGCAGCAGTGTCAGCAGACCCATTAGCATCTCTCTTGGAAATGTGCCACAGCACACAGATTGGAGGAGTAAGGATGGTGGACATCCTTAGACAGAATCCAACTGAGGAACAAGCCGTAGACATATGCAAGGCAGCAATAGGGTTGAGGATTAGCTCATCTTTCAGTTTTGGTGGGTTCACTTTCAAAAGGACAAGCGGATCATCAGTCAAGAAAGAAGAAGAAGTGCTAACGGGCAACCTCCAAACACTGAAAATAAGAGTACATGAAGGGTATGAAGAATTCACAATGGTTGGGAGAAGAGCAACAGCTATTCTCAGAAAGGCAACCAGGAGA

Still a massive vocab for most models, so I tried using XLNet (the values are a bit messed up here - realized I had 1 as a prime, as seen in the above, which led to much smaller size)

import torch
from transformers import *

tokenizer = XLNetTokenizer.from_pretrained(''xlnet-base-cased')
model = XLNetModel.from_pretrained('xlnet-base-cased')

num_added_toks = tokenizer.add_tokens(complete_tokens) # list of the deduplicated tokens
print('We have added', num_added_toks, 'tokens')
model.resize_token_embeddings(len(tokenizer))

>> We have added 65134 tokens
>> Embedding(97134, 768)

This is where I'm currently at. My first goal is to attempt for Sequence Classification/Entailment. Stuck on how to pre-process the data into the correct format for that task.

Also - I realized that the flu dataset is a lot smaller than it should be, so I'll reupload the updated version in the folder soon.