Open jacksonh1 opened 2 years ago
Hi Jackson,
Thanks, I'm glad you like the work!
We applied these filters at different times:
Hope this helps.
Alex
From: jacksonh1 @.> Sent: Friday, November 11, 2022 5:20 PM To: alexxijielu/reverse_homology @.> Cc: Subscribed @.***> Subject: [alexxijielu/reverse_homology] Question about human homologous IDR dataset (Issue #2)
Hello Alex,
I quite like your paper, and the work that you've done is very exciting! I think that your dataset of homologous IDRs in the human proteome could be a very valuable resource for me in some of the research I'm doing. I've been looking at the most recent dataset deposited to zenodo (V6 - 6311384https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fzenodo.org%2Frecord%2F6311384&data=05%7C01%7Clualex%40microsoft.com%7Cd785a2a7af2c46c5783008dac432d679%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638038019874091879%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=COYq1xY4N4R754sQXHMsJinQT1gJ8Ga6EmJQVLOPLjg%3D&reserved=0) (file - human_idr_homologues.zip) and had just a few questions about it. Are these the homolog IDRs after applying the evolutionary distance-based clustering/filtering method described in the methods to remove "redundant" sequences? The methods also state that sequences with "X" characters or that were 3x longer/shorter than the human protein were removed. However, I've found cases in the dataset where there are sequences with 'X' characters or with empty sequences present (See example below). It is easy for me to filter those out on my end, but it did make me wonder if the clustering/redundancy filter has been applied to these sequences or not?
example from HUMAN13574_301to347.fasta:
ACAM105111
-Q-------G--------------------------------------------------
CHAP605088
-K----------------------------------------------------------
SYNY302936
-K-------K--------------------------------------------------
PROMS00447
PROM400393
-C----------------------------------------------------------
PROM200471
PROM000417
PROM302117
PROM900417
PROMM00236
PROM500458
PROM100444
-E-------EVST-----------------------------------------------
PROMT00426
-E-------EV-------------------------------------------------
PROMA00393
PROMP00444
CYAGP02842
SYNE702106
SYNP602035
SYNP302362
-Q-------IAND-----------------------------------------------
Thanks so much for the help, Jackson
- Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Falexxijielu%2Freverse_homology%2Fissues%2F2&data=05%7C01%7Clualex%40microsoft.com%7Cd785a2a7af2c46c5783008dac432d679%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638038019874091879%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=5gV3D8lX4Y0ORWNg0gyAYbmlgnZW9SGtXK3rDcmd2UY%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAE2KQENSB4OPDDYXMPSCMODWH3BABANCNFSM6AAAAAAR57KM4I&data=05%7C01%7Clualex%40microsoft.com%7Cd785a2a7af2c46c5783008dac432d679%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638038019874091879%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=JKLK8tLvtdf4LTHnAqRrb25oxxxGZZfT7N7cpv46MKM%3D&reserved=0. You are receiving this because you are subscribed to this thread.Message ID: @.**@.>>
Hi Alex,
Thank you so much for the quick response! I'm not sure if I will be able to see responses from your collaborator if they are cc'd via email response to the github issue, so I apologize if they have also replied and I haven't seen it. Anyway, this makes sense to me, thanks for the help. I had assumed the redundancy filter was applied to the IDRs, but it probably makes more sense that you used the full-length proteins for that.
Just to double-check that I'm using the correct dataset, is the latest version on Zenodo (V6) (6311384) the right version to use?
I wasn't sure because I noticed that some of the V6 entries are aligned (have gaps and have the same number of characters) and some are not, whereas the V1 entries seem to all be aligned
For example, from HUMAN00013_668to702.fasta
:
>OTOGA09052 488 to 502
DRPGRGLGPSSLGAG
>CAVPO00930 600 to 630
DRSRGVLPPTTLLQLQTSEPSRSMGTRTPPE
>SHEEP03855 719 to 753
DRSGGTLGPAALLQTQVTEPPRSVLWGVGTGAPPE
>AILME01473 600 to 634
DSSGSTLELAALLQLQAAEPPSLVPWGVEPGTPPE
and from HUMAN00068_58to122.fasta
:
>LEPOC09969
--------------------------------------------------------------TLPEL---------------IPH
>ANATE20091
-------------------------------------------------------------------------------------
>TAKRU09568
------------------------------------------------------------PPVTPDL---------------FPE
>TETNG14694
----------------------------------------------------------FQPPVTPDL---------------FPE
Thanks again for the help! Jackson
Hi Jackson,
I'll let you know once Iva gets back to me on this! This does seem to be a weird inconsistency and I will dig a little deeper into this. I think the likely explanation of what happened is because there are some inconsistencies between OMA and UniProt, and we had to remap some of these IDRs based upon sequence homology. In any case, it shouldn't impact the reverse homology model itself (since we strip out all alignment tokens as preprocessing), but it could impact other applications...
Alex
From: jacksonh1 @.> Sent: Tuesday, November 15, 2022 4:54 PM To: alexxijielu/reverse_homology @.> Cc: Alex Lu @.>; Comment @.> Subject: Re: [alexxijielu/reverse_homology] Question about human homologous IDR dataset (Issue #2)
Hi Alex,
Thank you so much for the quick response! I'm not sure if I will be able to see responses from your collaborator if they are cc'd via email response to the github issue, so I apologize if they have also replied and I haven't seen it. Anyway, this makes sense to me, thanks for the help. I had assumed the redundancy filter was applied to the IDRs, but it probably makes more sense that you used the full-length proteins for that.
Just to double-check that I'm using the correct dataset, is the latest version on Zenodo (V6) (6311384https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fzenodo.org%2Frecord%2F6311384&data=05%7C01%7Clualex%40microsoft.com%7Cdfa3139accce4e143a3508dac753dccf%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638041460247245154%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=RdZQ0sgJ2yRZTpMYDqZ%2F%2Bd7DYUsSVGkEXmB7cQH1o5A%3D&reserved=0) the right version to use? I wasn't sure because I noticed that some of the V6 entries are aligned (have gaps and have the same number of characters) and some are not, whereas the V1 entries seem to all be aligned For example, from HUMAN00013_668to702.fasta:
OTOGA09052 488 to 502
DRPGRGLGPSSLGAG
CAVPO00930 600 to 630
DRSRGVLPPTTLLQLQTSEPSRSMGTRTPPE
SHEEP03855 719 to 753
DRSGGTLGPAALLQTQVTEPPRSVLWGVGTGAPPE
AILME01473 600 to 634
DSSGSTLELAALLQLQAAEPPSLVPWGVEPGTPPE
and from HUMAN00068_58to122.fasta:
LEPOC09969
--------------------------------------------------------------TLPEL---------------IPH
ANATE20091
TAKRU09568
------------------------------------------------------------PPVTPDL---------------FPE
TETNG14694
----------------------------------------------------------FQPPVTPDL---------------FPE
Thanks again for the help! Jackson
- Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Falexxijielu%2Freverse_homology%2Fissues%2F2%23issuecomment-1315908852&data=05%7C01%7Clualex%40microsoft.com%7Cdfa3139accce4e143a3508dac753dccf%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638041460247245154%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=UH8HXxDOfRmRiEjDizFjBAw20h%2BSnMDMI3736Nqtx7A%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAE2KQEITI67UIBHVZTQM6RTWIQA6NANCNFSM6AAAAAAR57KM4I&data=05%7C01%7Clualex%40microsoft.com%7Cdfa3139accce4e143a3508dac753dccf%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638041460247245154%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=4sdxExvEsVXafIUFbquKNULFzWvlR1z7CaafEO3zzwY%3D&reserved=0. You are receiving this because you commented.Message ID: @.**@.>>
Hi Alex,
Oh okay, awesome thanks! I've also been stripping out the alignment tokens so it shouldn't be an issue for me either but I thought I'd double-check to make sure I've got the right dataset.
Thanks again! Jackson
Hello Alex,
I quite like your paper, and the work that you've done is very exciting! I think that your dataset of homologous IDRs in the human proteome could be a very valuable resource for me in some of the research I'm doing. I've been looking at the most recent dataset deposited to zenodo (V6 - 6311384) (file - human_idr_homologues.zip) and had just a few questions about it. Are these the homolog IDRs after applying the evolutionary distance-based clustering/filtering method described in the methods to remove "redundant" sequences? The methods also state that sequences with "X" characters or that were 3x longer/shorter than the human protein were removed. However, I've found cases in the dataset where there are sequences with 'X' characters or with empty sequences present (See example below). It is easy for me to filter those out on my end, but it did make me wonder if the clustering/redundancy filter has been applied to these sequences or not?
example from
HUMAN13574_301to347.fasta
:Thanks so much for the help, Jackson