Question about human homologous IDR dataset

jacksonh1 commented 2 years ago

Hello Alex,

I quite like your paper, and the work that you've done is very exciting! I think that your dataset of homologous IDRs in the human proteome could be a very valuable resource for me in some of the research I'm doing. I've been looking at the most recent dataset deposited to zenodo (V6 - 6311384) (file - human_idr_homologues.zip) and had just a few questions about it. Are these the homolog IDRs after applying the evolutionary distance-based clustering/filtering method described in the methods to remove "redundant" sequences? The methods also state that sequences with "X" characters or that were 3x longer/shorter than the human protein were removed. However, I've found cases in the dataset where there are sequences with 'X' characters or with empty sequences present (See example below). It is easy for me to filter those out on my end, but it did make me wonder if the clustering/redundancy filter has been applied to these sequences or not?

example from HUMAN13574_301to347.fasta:

>ACAM105111
-Q-------G--------------------------------------------------
>CHAP605088
-K----------------------------------------------------------
>SYNY302936
-K-------K--------------------------------------------------
>PROMS00447
------------------------------------------------------------
>PROM400393
-C----------------------------------------------------------
>PROM200471
------------------------------------------------------------
>PROM000417
------------------------------------------------------------
>PROM302117
------------------------------------------------------------
>PROM900417
------------------------------------------------------------
>PROMM00236
------------------------------------------------------------
>PROM500458
------------------------------------------------------------
>PROM100444
-E-------EVST-----------------------------------------------
>PROMT00426
-E-------EV-------------------------------------------------
>PROMA00393
------------------------------------------------------------
>PROMP00444
------------------------------------------------------------
>CYAGP02842
------------------------------------------------------------
>SYNE702106
------------------------------------------------------------
>SYNP602035
------------------------------------------------------------
>SYNP302362
-Q-------IAND-----------------------------------------------

Thanks so much for the help, Jackson

alexxijielu commented 2 years ago

Hi Jackson,

Thanks, I'm glad you like the work!

We applied these filters at different times:

The raw data should have the homologues filtered by redundancy. I believe this is done at the protein-level, not the IDR level (CCing my collaborator Iva to double-check this), so you might still get some IDRs with all gaps depending on the alignment.
After, we filter IDR sequences (for "X" characters and under 5 AAs) in the data loader itself.

Hope this helps.

Alex

From: jacksonh1 @.> Sent: Friday, November 11, 2022 5:20 PM To: alexxijielu/reverse_homology @.> Cc: Subscribed @.***> Subject: [alexxijielu/reverse_homology] Question about human homologous IDR dataset (Issue #2)

Hello Alex,

I quite like your paper, and the work that you've done is very exciting! I think that your dataset of homologous IDRs in the human proteome could be a very valuable resource for me in some of the research I'm doing. I've been looking at the most recent dataset deposited to zenodo (V6 - 6311384https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fzenodo.org%2Frecord%2F6311384&data=05%7C01%7Clualex%40microsoft.com%7Cd785a2a7af2c46c5783008dac432d679%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638038019874091879%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=COYq1xY4N4R754sQXHMsJinQT1gJ8Ga6EmJQVLOPLjg%3D&reserved=0) (file - human_idr_homologues.zip) and had just a few questions about it. Are these the homolog IDRs after applying the evolutionary distance-based clustering/filtering method described in the methods to remove "redundant" sequences? The methods also state that sequences with "X" characters or that were 3x longer/shorter than the human protein were removed. However, I've found cases in the dataset where there are sequences with 'X' characters or with empty sequences present (See example below). It is easy for me to filter those out on my end, but it did make me wonder if the clustering/redundancy filter has been applied to these sequences or not?

example from HUMAN13574_301to347.fasta:

ACAM105111

-Q-------G--------------------------------------------------

CHAP605088

-K----------------------------------------------------------

SYNY302936

-K-------K--------------------------------------------------

PROMS00447

PROM400393

-C----------------------------------------------------------

PROM200471

PROM000417

PROM302117

PROM900417

PROMM00236

PROM500458

PROM100444

-E-------EVST-----------------------------------------------

PROMT00426

-E-------EV-------------------------------------------------

PROMA00393

PROMP00444

CYAGP02842

SYNE702106

SYNP602035

SYNP302362

-Q-------IAND-----------------------------------------------

Thanks so much for the help, Jackson

- Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Falexxijielu%2Freverse_homology%2Fissues%2F2&data=05%7C01%7Clualex%40microsoft.com%7Cd785a2a7af2c46c5783008dac432d679%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638038019874091879%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=5gV3D8lX4Y0ORWNg0gyAYbmlgnZW9SGtXK3rDcmd2UY%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAE2KQENSB4OPDDYXMPSCMODWH3BABANCNFSM6AAAAAAR57KM4I&data=05%7C01%7Clualex%40microsoft.com%7Cd785a2a7af2c46c5783008dac432d679%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638038019874091879%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=JKLK8tLvtdf4LTHnAqRrb25oxxxGZZfT7N7cpv46MKM%3D&reserved=0. You are receiving this because you are subscribed to this thread.Message ID: @.**@.>>

jacksonh1 commented 2 years ago

Hi Alex,

Thank you so much for the quick response! I'm not sure if I will be able to see responses from your collaborator if they are cc'd via email response to the github issue, so I apologize if they have also replied and I haven't seen it. Anyway, this makes sense to me, thanks for the help. I had assumed the redundancy filter was applied to the IDRs, but it probably makes more sense that you used the full-length proteins for that.

Just to double-check that I'm using the correct dataset, is the latest version on Zenodo (V6) (6311384) the right version to use? I wasn't sure because I noticed that some of the V6 entries are aligned (have gaps and have the same number of characters) and some are not, whereas the V1 entries seem to all be aligned For example, from HUMAN00013_668to702.fasta:

>OTOGA09052 488 to 502
DRPGRGLGPSSLGAG
>CAVPO00930 600 to 630
DRSRGVLPPTTLLQLQTSEPSRSMGTRTPPE
>SHEEP03855 719 to 753
DRSGGTLGPAALLQTQVTEPPRSVLWGVGTGAPPE
>AILME01473 600 to 634
DSSGSTLELAALLQLQAAEPPSLVPWGVEPGTPPE

and from HUMAN00068_58to122.fasta:

>LEPOC09969
--------------------------------------------------------------TLPEL---------------IPH
>ANATE20091
-------------------------------------------------------------------------------------
>TAKRU09568
------------------------------------------------------------PPVTPDL---------------FPE
>TETNG14694
----------------------------------------------------------FQPPVTPDL---------------FPE

Thanks again for the help! Jackson

alexxijielu commented 2 years ago

Hi Jackson,

I'll let you know once Iva gets back to me on this! This does seem to be a weird inconsistency and I will dig a little deeper into this. I think the likely explanation of what happened is because there are some inconsistencies between OMA and UniProt, and we had to remap some of these IDRs based upon sequence homology. In any case, it shouldn't impact the reverse homology model itself (since we strip out all alignment tokens as preprocessing), but it could impact other applications...

Alex

From: jacksonh1 @.> Sent: Tuesday, November 15, 2022 4:54 PM To: alexxijielu/reverse_homology @.> Cc: Alex Lu @.>; Comment @.> Subject: Re: [alexxijielu/reverse_homology] Question about human homologous IDR dataset (Issue #2)

Hi Alex,

Thank you so much for the quick response! I'm not sure if I will be able to see responses from your collaborator if they are cc'd via email response to the github issue, so I apologize if they have also replied and I haven't seen it. Anyway, this makes sense to me, thanks for the help. I had assumed the redundancy filter was applied to the IDRs, but it probably makes more sense that you used the full-length proteins for that.

Just to double-check that I'm using the correct dataset, is the latest version on Zenodo (V6) (6311384https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fzenodo.org%2Frecord%2F6311384&data=05%7C01%7Clualex%40microsoft.com%7Cdfa3139accce4e143a3508dac753dccf%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638041460247245154%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=RdZQ0sgJ2yRZTpMYDqZ%2F%2Bd7DYUsSVGkEXmB7cQH1o5A%3D&reserved=0) the right version to use? I wasn't sure because I noticed that some of the V6 entries are aligned (have gaps and have the same number of characters) and some are not, whereas the V1 entries seem to all be aligned For example, from HUMAN00013_668to702.fasta:

OTOGA09052 488 to 502

DRPGRGLGPSSLGAG

CAVPO00930 600 to 630

DRSRGVLPPTTLLQLQTSEPSRSMGTRTPPE

SHEEP03855 719 to 753

DRSGGTLGPAALLQTQVTEPPRSVLWGVGTGAPPE

AILME01473 600 to 634

DSSGSTLELAALLQLQAAEPPSLVPWGVEPGTPPE

and from HUMAN00068_58to122.fasta:

LEPOC09969

--------------------------------------------------------------TLPEL---------------IPH

ANATE20091

TAKRU09568

------------------------------------------------------------PPVTPDL---------------FPE

TETNG14694

----------------------------------------------------------FQPPVTPDL---------------FPE

Thanks again for the help! Jackson

- Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Falexxijielu%2Freverse_homology%2Fissues%2F2%23issuecomment-1315908852&data=05%7C01%7Clualex%40microsoft.com%7Cdfa3139accce4e143a3508dac753dccf%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638041460247245154%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=UH8HXxDOfRmRiEjDizFjBAw20h%2BSnMDMI3736Nqtx7A%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAE2KQEITI67UIBHVZTQM6RTWIQA6NANCNFSM6AAAAAAR57KM4I&data=05%7C01%7Clualex%40microsoft.com%7Cdfa3139accce4e143a3508dac753dccf%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638041460247245154%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=4sdxExvEsVXafIUFbquKNULFzWvlR1z7CaafEO3zzwY%3D&reserved=0. You are receiving this because you commented.Message ID: @.**@.>>

jacksonh1 commented 2 years ago

Hi Alex,

Oh okay, awesome thanks! I've also been stripping out the alignment tokens so it shouldn't be an issue for me either but I thought I'd double-check to make sure I've got the right dataset.

Thanks again! Jackson

alexxijielu / reverse_homology

Question about human homologous IDR dataset #2