Visual evaluation of models using ROC, and other question

wq-ls commented 5 months ago

Hi, @egochao

It's very exciting that you provide a simple method to train and predict promoter sequences. This is friendly to us newbies in machine learning. I encountered some problems during use and would like to ask you for some advice.

Your model uses the promoter sequences of one species for training separately. Does it make sense if I combine the promoter data of some species for training?

For example, I combined promoter sequences identified in more than 60 species of fish and trained a model. I want to use this trained model to predict promoter regions of other fish species, is this possible?

The final prediction result log： for fish-TATA, 46800 sequences, Test precision: 0.9281830539193475；Test recall: 0.8752403332621235；Test MCC :0.8088365907172335 for fish-nonTATA, 77245 sequences, Test precision: 0.934273019764057；Test recall: 0.7893851132686084；Test MCC :0.7428381575465615

This is only the result of one training session. Compared with the training results of mouse and human models, it seems not very good. I am also trying to increase the number of trainings.

Can you give me some suggestions on other parameters that need to be adjusted? For example, the “I used the final set of parameter from the paper. Kernel size = [27, 14, 7], and maxpooling with kernel = 6” you use. I'm a little confused whether or not I should adjust it for my data, and how?

In fact, I also used the zebrafish promoter sequence from EPDnew database to try to predict, but too few TATA-box sequences(700 sequence) caused the program to fail to train the model and reported an error. This is why I want to merge more fish promoter sequences together for prediction.

But this also raised another issue: There may be inaccurate promoter sequences in the data set I constructed myself. For example, it is not a real TATA-box sequence. I put this into the data set for training.

After all, the data is not as reliable as coming from the EPDnew database. But I can estimate that about 90% of the sequences are relatively accurate.

Will this have a big impact on the final model training? How do I do the assessment? Considering that there are errors in the data set itself, what should the final precision, recall, and mcc reach to be considered a suitable model?

I wanted to try to output ROC curve pictures by modifying your test.py, but I failed. Can you give me some main modification suggestions? I'm not quite sure how I should structure the test dataset, such as labeling and classifying it.

Best wishes,

Shuo Li

egochao commented 4 months ago

Hi @wq-ls sorry for the later reply.

I have to say that I don't have enough expertise in this domain to answer your question here. This repo is my best effort to replicate the paper result.

I will recommend

Be super careful with the data. 90% correct rate in dataset is very bad for training model.

wq-ls commented 4 months ago

I strongly agree with your suggestion, because I did encounter a lot of problems when I continued to conduct subsequent analysis after predicting the results according to my method.

I should reprocess my dataset to make sure it's accurate.

Finally, thank you very much for your reply.

egochao / DeePromoter

Visual evaluation of models using ROC, and other question #3