how to run the code with fasta sequence data

smruti241 commented 1 year ago

Hi @Moeinh77 I have tried to run the code with my data but it is giving me certain errors like num classes are not defined, sequences are not defined. Can you please tell me exactly how to use it for my data? the data are in the form of fasta sequences

smruti241 commented 1 year ago

Hi @Moeinh77 I tried with my data, it gave the following error: (dnabert) smrutip@iiitd:~/Virus-DNA-classification-BERT$ python test_data.py Some weights of the model checkpoint at zhihan1996/DNA_bert_6 were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias']

This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of BertForSequenceClassification were not initialized from the model checkpoint at zhihan1996/DNA_bert_6 and are newly initialized: ['classifier.weight', 'classifier.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Traceback (most recent call last): File "test_data.py", line 16, in train_dataset = HF_dataset(train_encodings["input_ids"], train_encodings["attention_mask"], y_train) NameError: name 'y_train' is not defined

in this case, y_train is not given. can you please tell me the whole scenario or can you give me the full code? Thank you

Moeinh77 commented 1 year ago

Hi, so my main.py file goes through the process of reading the data, have a look at that. The idea is to convert your sequences into KMER form and feed that into the model. The test.py file you are talking about is not from this repository so unfortunately I don't know how to help you with that. But my main.py file can show you how to train and test the model.

smruti241 commented 1 year ago

ok thanks @Moeinh77 . I will check that and will get back to you.

smruti241 commented 1 year ago

@Moeinh77 can you please tell me what is "data/Trainingdata.csv", "data/TestData/Testdata-2.csv", "data/TestData/" in main.py file? how to make train and test data from main data?

Moeinh77 commented 1 year ago

So they were the CSV file that belonged to the research paper which I got the data from. If you would like to see those files, go to this URL and download their data: http://www.nitttrkol.ac.in/indrajit/projects/COVID-DeepPredictor/

Best regards, Moein Hasani Bioinformatics Lab, University of Saskatchewan Connect with me : My LinkedIn https://www.linkedin.com/in/moein-hasani/ My GitHub https://github.com/Moeinh77

On Tue, Apr 4, 2023 at 9:42 AM Smruti Panda @.***> wrote:

@Moeinh77 https://github.com/Moeinh77 can you please tell me what is "data/Trainingdata.csv", "data/TestData/Testdata-2.csv", "data/TestData/" in main.py file? how to make train and test data from main data?

— Reply to this email directly, view it on GitHub https://github.com/Moeinh77/Virus-DNA-classification-BERT/issues/3#issuecomment-1496203560, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG5MIHPHBNDARLTXGCCEAALW7Q6NRANCNFSM6AAAAAAWPZ6VYQ . You are receiving this because you were mentioned.Message ID: @.***>

smruti241 commented 1 year ago

@Moeinh77 The data on which you ran the code contains PID, class, classnumber and sequence through which the code is getting labels. I dont have anything except sequence (ATGCATGCATGACA). Can you please tell me how can I modify the code for sequences only and how to fine tune them without labels? Thank you

Moeinh77 commented 1 year ago

Hi, so yes this code (in my repository) is useful for classification not the pertaining. If you have a very large dataset (larger than the human genome that DNABERT uses) go ahead and use that for pre-training of the model. But if not, then there is no need for pre-training the network again. It is already trained on a large dataset, just use it for a downstream task like the classification of sequences or binding site detection or tasks like this.

Moeinh77 / Virus-DNA-classification-BERT

how to run the code with fasta sequence data #3