Open yang-arina opened 10 months ago
Hi! You are correct that the testing datasets mentioned in the paper are not publicly available. This is because I think some of the datasets may require licenses and prohibit redistribution (though I am not positive about this). I can however point you to exactly where to download each dataset (note that during MutFormer's experiments, we filtered all testing datasets by removing data points present in any pretraining or finetuning data; if you are interested in the filtered dataset versions, I can take a closer look at the licensing situation again and if possible provide you with these sets):
Hope this helps!
-Theo
Thank you for your patient reply! I am indeed very interested in your filtered dataset versions. I would be extremely grateful if it were convenient for you to provide me with these filtered datasets!
Hello! I sincerely apologize for reaching out again. I would like to inquire if you could provide me with the filtered dataset regarding the above-mentioned issue. This is crucial for our research, and I would greatly appreciate it if you could respond at your earliest convenience. Thank you very much.
Hi! So sorry for the late reply; it does appear that since all the data sources are open-source, it should be okay. Here are the filtered datasets (please let me know if you have any more questions!): test_data.zip
Hi! I sincerely appreciate your patience and professionalism. The data you provided will be immensely helpful to us! I have downloaded and reviewed the test dataset, and I noticed that there are 5 columns. According to my understanding, these columns should be labeled as follows: label, reference sequence, mutant sequence, external data, and mutation position. This aligns perfectly with the format required for MutFormer input files. However, I am eager to know whether the mutations include an identifier similar to 'Q96NU1_P10S,' encompassing the UniProt canonical accession, reference residue, mutation position, and mutation residue information. This unique ID would be instrumental for our cross-referencing efforts. Your assistance in providing this information would be greatly appreciated! Thank you once again for your invaluable support.
Furthermore, I checked the variant counts, and they seem not to correspond entirely between the description of test sets in paper and the downloaded TSV files. Only the variant counts in test_set1.tsv and test_set3.tsv match up. Additionally, is there any relationship between the numbers of variants in the validation set and the test set?
My data summary table is as follows:
However, I am eager to know whether the mutations include an identifier similar to 'Q96NU1_P10S,' encompassing the UniProt canonical accession, reference residue, mutation position, and mutation residue information. This unique ID would be instrumental for our cross-referencing efforts. Your assistance in providing this information would be greatly appreciated! Thank you once again for your invaluable support.
Hi! That is a really good question; an accession would indeed be really helpful, though I think the source of these datasets also unfortunately did not provide any accession IDs. I really apologize for the inconvenience on this end. I believe it definitely should be possible to backtrack using the reference sequence, though that definitely is by no means simple. I can spend some time looking into this and will let you know if I do end up finding some accession ID information.
-Theo
Furthermore, I checked the variant counts, and they seem not to correspond entirely between the description of test sets in paper and the downloaded TSV files. Only the variant counts in test_set1.tsv and test_set3.tsv match up. Additionally, is there any relationship between the numbers of variants in the validation set and the test set?
That is a great catch! Upon looking into this more, I believe the discrepancy between the numbers in the two sets is actually a versioning mistake; it seems to me that the numbers in the paper don't reflect the most updated values. As for the actual test set, this should be the most updated version and should be the data we used to obtain the results in the ROC/PRG curves, so you should be able to go ahead and use it; the discrepancy is still due to the versioning error in the numbers. So sorry for the confusion and please let me know if you have any more questions!
-Theo
However, I am eager to know whether the mutations include an identifier similar to 'Q96NU1_P10S,' encompassing the UniProt canonical accession, reference residue, mutation position, and mutation residue information. This unique ID would be instrumental for our cross-referencing efforts. Your assistance in providing this information would be greatly appreciated! Thank you once again for your invaluable support.
Hi! That is a really good question; an accession would indeed be really helpful, though I think the source of these datasets also unfortunately did not provide any accession IDs. I really apologize for the inconvenience on this end. I believe it definitely should be possible to backtrack using the reference sequence, though that definitely is by no means simple. I can spend some time looking into this and will let you know if I do end up finding some accession ID information.
-Theo
Thank you for your reply! I have considered using BLAST to obtain the corresponding accession for the reference sequences. However, my concern is that, due to the sequence length limitation of 1024, the reference sequences provided in the file should ideally not exceed 512 in length. This is also a challenge, as per my understanding.
Furthermore, I checked the variant counts, and they seem not to correspond entirely between the description of test sets in paper and the downloaded TSV files. Only the variant counts in test_set1.tsv and test_set3.tsv match up. Additionally, is there any relationship between the numbers of variants in the validation set and the test set?
That is a great catch! Upon looking into this more, I believe the discrepancy between the numbers in the two sets is actually a versioning mistake; it seems to me that the numbers in the paper don't reflect the most updated values. As for the actual test set, this should be the most updated version and should be the data we used to obtain the results in the ROC/PRG curves, so you should be able to go ahead and use it; the discrepancy is still due to the versioning error in the numbers. So sorry for the confusion and please let me know if you have any more questions!
-Theo
Thank you once again for your patient guidance and explanations. In terms of the quantity, I currently have no further questions.
HI! I've been trying to access the file named "http://www.openbioinformatics.org/mutformer/hg19_MutFormer.zip" and the 'basic_example.zip i tried to open this link on different pc's but its not working on either of them.can you help me with that?
Hello,
I am able to use my browser or use the command wget http://www.openbioinformatics.org/mutformer/hg19_MutFormer.zip
to download the file.
Could you please try again and let me know if you still have the problem
yes. i still cant access the files
can it be because of the region?? can you provide me with the files here or perhaps email me??
The file is too large and cannot be attached here. Please let me know if you can download from here: http://144.34.239.101/0c10d412d553db42102cec1fb51e668a/hg19_MutFormer.zip
this link isn't loading either!
The server (http://144.34.239.101/ ) has been working for years and being accessed by people from all over the world.
Please try to ping 144.34.239.101 and see what you get
yes, PPAR gene database is there.
does your browser/institute block HTTP downloading because it is not HTTPS?
no, there's nothing like that.
try to open this link http://144.34.239.101/0c10d412d553db42102cec1fb51e668a
and then click the file to download
yes, it's working now. i need the link for test zip too please.
It is uploaded to the same place. please open http://144.34.239.101/0c10d412d553db42102cec1fb51e668a/ and download
Thank you so much sir for your help.i really appreciate it :)
sir, can i get the link to hg38_MUTFORMER file and http://www.openbioinformatics.org/mutformer/basic_example.zip??
i need the file for analysis because it appears that the data i'm using is from hg38. can you please provide the link?
Hello, Sorry for the late reply, these files have been uploaded to the same place. Please open http://144.34.239.101/0c10d412d553db42102cec1fb51e668a/ and download.
Hello! As it is mentioned in paper(https://www.sciencedirect.com/science/article/pii/S2666675823001157) that "To assess the performance of MutFormer against existing methods of deleteriousness prediction, a total of five [testing datasets] were used." They were showed in "Table 2. Details for each testing dataset". However, it appears that I couldn't find these five test sets in the supplementary materials of the article or on GitHub. I would be very grateful if you could provide them to me!