WGLab / mutformer

A transformer model to predict pathogenic mutations
Apache License 2.0
11 stars 2 forks source link

How can the five mentioned test sets in the article be obtained? #1

Open yang-arina opened 10 months ago

yang-arina commented 10 months ago

Hello! As it is mentioned in paper(https://www.sciencedirect.com/science/article/pii/S2666675823001157) that "To assess the performance of MutFormer against existing methods of deleteriousness prediction, a total of five [testing datasets] were used." They were showed in "Table 2. Details for each testing dataset". However, it appears that I couldn't find these five test sets in the supplementary materials of the article or on GitHub. I would be very grateful if you could provide them to me!

tianqitheodorejiang commented 10 months ago

Hi! You are correct that the testing datasets mentioned in the paper are not publicly available. This is because I think some of the datasets may require licenses and prohibit redistribution (though I am not positive about this). I can however point you to exactly where to download each dataset (note that during MutFormer's experiments, we filtered all testing datasets by removing data points present in any pretraining or finetuning data; if you are interested in the filtered dataset versions, I can take a closer look at the licensing situation again and if possible provide you with these sets):

Hope this helps!

-Theo

yang-arina commented 10 months ago

Thank you for your patient reply! I am indeed very interested in your filtered dataset versions. I would be extremely grateful if it were convenient for you to provide me with these filtered datasets!

yang-arina commented 10 months ago

Hello! I sincerely apologize for reaching out again. I would like to inquire if you could provide me with the filtered dataset regarding the above-mentioned issue. This is crucial for our research, and I would greatly appreciate it if you could respond at your earliest convenience. Thank you very much.

tianqitheodorejiang commented 10 months ago

Hi! So sorry for the late reply; it does appear that since all the data sources are open-source, it should be okay. Here are the filtered datasets (please let me know if you have any more questions!): test_data.zip

yang-arina commented 10 months ago

Hi! I sincerely appreciate your patience and professionalism. The data you provided will be immensely helpful to us! I have downloaded and reviewed the test dataset, and I noticed that there are 5 columns. According to my understanding, these columns should be labeled as follows: label, reference sequence, mutant sequence, external data, and mutation position. This aligns perfectly with the format required for MutFormer input files. However, I am eager to know whether the mutations include an identifier similar to 'Q96NU1_P10S,' encompassing the UniProt canonical accession, reference residue, mutation position, and mutation residue information. This unique ID would be instrumental for our cross-referencing efforts. Your assistance in providing this information would be greatly appreciated! Thank you once again for your invaluable support.

yang-arina commented 10 months ago

Furthermore, I checked the variant counts, and they seem not to correspond entirely between the description of test sets in paper and the downloaded TSV files. Only the variant counts in test_set1.tsv and test_set3.tsv match up. Additionally, is there any relationship between the numbers of variants in the validation set and the test set?

My data summary table is as follows: 5c1e6ac41c07af6c732a6acf012fef0

tianqitheodorejiang commented 10 months ago

However, I am eager to know whether the mutations include an identifier similar to 'Q96NU1_P10S,' encompassing the UniProt canonical accession, reference residue, mutation position, and mutation residue information. This unique ID would be instrumental for our cross-referencing efforts. Your assistance in providing this information would be greatly appreciated! Thank you once again for your invaluable support.

Hi! That is a really good question; an accession would indeed be really helpful, though I think the source of these datasets also unfortunately did not provide any accession IDs. I really apologize for the inconvenience on this end. I believe it definitely should be possible to backtrack using the reference sequence, though that definitely is by no means simple. I can spend some time looking into this and will let you know if I do end up finding some accession ID information.

-Theo

tianqitheodorejiang commented 10 months ago

Furthermore, I checked the variant counts, and they seem not to correspond entirely between the description of test sets in paper and the downloaded TSV files. Only the variant counts in test_set1.tsv and test_set3.tsv match up. Additionally, is there any relationship between the numbers of variants in the validation set and the test set?

That is a great catch! Upon looking into this more, I believe the discrepancy between the numbers in the two sets is actually a versioning mistake; it seems to me that the numbers in the paper don't reflect the most updated values. As for the actual test set, this should be the most updated version and should be the data we used to obtain the results in the ROC/PRG curves, so you should be able to go ahead and use it; the discrepancy is still due to the versioning error in the numbers. So sorry for the confusion and please let me know if you have any more questions!

-Theo

yang-arina commented 10 months ago

However, I am eager to know whether the mutations include an identifier similar to 'Q96NU1_P10S,' encompassing the UniProt canonical accession, reference residue, mutation position, and mutation residue information. This unique ID would be instrumental for our cross-referencing efforts. Your assistance in providing this information would be greatly appreciated! Thank you once again for your invaluable support.

Hi! That is a really good question; an accession would indeed be really helpful, though I think the source of these datasets also unfortunately did not provide any accession IDs. I really apologize for the inconvenience on this end. I believe it definitely should be possible to backtrack using the reference sequence, though that definitely is by no means simple. I can spend some time looking into this and will let you know if I do end up finding some accession ID information.

-Theo

Thank you for your reply! I have considered using BLAST to obtain the corresponding accession for the reference sequences. However, my concern is that, due to the sequence length limitation of 1024, the reference sequences provided in the file should ideally not exceed 512 in length. This is also a challenge, as per my understanding.

yang-arina commented 10 months ago

Furthermore, I checked the variant counts, and they seem not to correspond entirely between the description of test sets in paper and the downloaded TSV files. Only the variant counts in test_set1.tsv and test_set3.tsv match up. Additionally, is there any relationship between the numbers of variants in the validation set and the test set?

That is a great catch! Upon looking into this more, I believe the discrepancy between the numbers in the two sets is actually a versioning mistake; it seems to me that the numbers in the paper don't reflect the most updated values. As for the actual test set, this should be the most updated version and should be the data we used to obtain the results in the ROC/PRG curves, so you should be able to go ahead and use it; the discrepancy is still due to the versioning error in the numbers. So sorry for the confusion and please let me know if you have any more questions!

-Theo

Thank you once again for your patient guidance and explanations. In terms of the quantity, I currently have no further questions.

davinaa-byte commented 1 month ago

HI! I've been trying to access the file named "http://www.openbioinformatics.org/mutformer/hg19_MutFormer.zip" and the 'basic_example.zip i tried to open this link on different pc's but its not working on either of them.can you help me with that?

fangli80 commented 1 month ago

Hello, I am able to use my browser or use the command wget http://www.openbioinformatics.org/mutformer/hg19_MutFormer.zip to download the file.

fangli80 commented 1 month ago

Could you please try again and let me know if you still have the problem

davinaa-byte commented 1 month ago

yes. i still cant access the files

davinaa-byte commented 1 month ago

can it be because of the region?? can you provide me with the files here or perhaps email me??

fangli80 commented 1 month ago

The file is too large and cannot be attached here. Please let me know if you can download from here: http://144.34.239.101/0c10d412d553db42102cec1fb51e668a/hg19_MutFormer.zip

davinaa-byte commented 1 month ago

this link isn't loading either!

fangli80 commented 1 month ago

The server (http://144.34.239.101/ ) has been working for years and being accessed by people from all over the world.

Please try to ping 144.34.239.101 and see what you get

davinaa-byte commented 1 month ago

yes, PPAR gene database is there.

fangli80 commented 1 month ago

does your browser/institute block HTTP downloading because it is not HTTPS?

davinaa-byte commented 1 month ago

no, there's nothing like that.

fangli80 commented 1 month ago

try to open this link http://144.34.239.101/0c10d412d553db42102cec1fb51e668a

image

and then click the file to download

davinaa-byte commented 1 month ago

yes, it's working now. i need the link for test zip too please.

fangli80 commented 1 month ago

It is uploaded to the same place. please open http://144.34.239.101/0c10d412d553db42102cec1fb51e668a/ and download

davinaa-byte commented 1 month ago

Thank you so much sir for your help.i really appreciate it :)

davinaa-byte commented 1 month ago

sir, can i get the link to hg38_MUTFORMER file and http://www.openbioinformatics.org/mutformer/basic_example.zip??

davinaa-byte commented 1 month ago

i need the file for analysis because it appears that the data i'm using is from hg38. can you please provide the link?

fangli80 commented 1 month ago

Hello, Sorry for the late reply, these files have been uploaded to the same place. Please open http://144.34.239.101/0c10d412d553db42102cec1fb51e668a/ and download.

image