ebi-pf-team / interproscan

Genome-scale protein function classification
Apache License 2.0
303 stars 67 forks source link

incorrect number of proteins in downloaded database #266

Closed jalan01 closed 2 years ago

jalan01 commented 2 years ago

I am trying to download the following data https://www.ebi.ac.uk:443/interpro/api/protein/UniProt/entry/InterPro/IPR008160/taxonomy/uniprot/2759/

It should contain ~ 80k proteins. I get more than 300k proteins when downloading via the python3 script!!

matthiasblum commented 2 years ago

Hi @jalan01,

I could not reproduce the issue.

I downloaded the script from https://www.ebi.ac.uk/interpro/result/download/#/protein/UniProt/entry/InterPro/IPR008160/taxonomy/uniprot/2759/|tsv and I got about 80k proteins. The script and the output file are attached to this message.

euk-proteins-for-ipr008160.zip

jalan01 commented 2 years ago

Dear Matthias

The script and the fasta file it produced (link) is attached. I have tried json, tsv and fasta and each time the number of sequence is different (250-300k). I was wondering if you could also share the fasta (80k) file for the eukaryota annotation.

https://myfiles.uni-bayreuth.de/filr/public-link/file-download/ff808082808343910180acfd33282af6/66843/-7566420924527750390/eukaryote-protein-matching-IPR008160.fasta

I am running the script on Mac running OS 12.12.1.

best regards,

Abhishek Jalan

On 5/9/22, Matthias Blum @.***> wrote:

Hi @jalan01,

I could not reproduce the issue.

I downloaded the script from https://www.ebi.ac.uk/interpro/result/download/#/protein/UniProt/entry/InterPro/IPR008160/taxonomy/uniprot/2759/|tsv and I got abound 80k proteins. The script and the output file are attached to this message.

euk-proteins-for-ipr008160.zip

-- Reply to this email directly or view it on GitHub: https://github.com/ebi-pf-team/interproscan/issues/266#issuecomment-1121219092 You are receiving this because you were mentioned.

Message ID: @.***>

matthiasblum commented 2 years ago

Hi @jalan01,

I found the issue: when downloading in the FASTA format, the script writes the same sequence for each hit of IPR008160.

For instance, IPR008160, has two hits on A0A060WQA3 (OTOL1_ONCMY), so the sequence of A0A060WQA3 is written twice:

$ grep -A 7 A0A060WQA3 eukaryote-protein-matching-IPR008160.fasta 
>A0A060WQA3|IPR008160|193...281|Otolin-1
MPSLRLLAILTTLLAVVLMATQSSATRTTRRPKPQNTKKPPRGGGTGGGGGGGDQPARLGFRQTTTTMSPSSSLGTDETT
EDTMTDAYSLSPTDSTTYAGDAYPTEFHTDSMALPGAGMGNYTLDYSHCYLNVCECCPPEKGPVGPPGERGPPGPGAERA
PAVPSETLALMLGTDIYTHTFKIPILLSFYLIGDKGDQGDTGMPGAPGILGKEGQKGDLGPKGEKGETGLPGLKGDLGER
GKPGWNGTQGEKGDLGKIGPAGPSGLTGPMGQNGQKGEMGECPTGEKGEKGEAGLPGPPGPRGSVGPPGVNGSNGLPGPV
GLRGQLGSPGGKGEAGGRGPPGLRGMPGPKGEKGPKGPRGVRGPKGPQGETAEQIRSAFSVGLFPSKSFPPPGLPVKFDK
VLYNEEEHWDPMLSKFNCTHPGVYVFSYHITVRNRPLRAALVINGVKKLRTRDSLYGQDIDQASNLALLRLASGDQVWLE
TLRDWNGVYSSSEDDSTFTGFLLYADPKA
>A0A060WQA3|IPR008160|285...371|Otolin-1
MPSLRLLAILTTLLAVVLMATQSSATRTTRRPKPQNTKKPPRGGGTGGGGGGGDQPARLGFRQTTTTMSPSSSLGTDETT
EDTMTDAYSLSPTDSTTYAGDAYPTEFHTDSMALPGAGMGNYTLDYSHCYLNVCECCPPEKGPVGPPGERGPPGPGAERA
PAVPSETLALMLGTDIYTHTFKIPILLSFYLIGDKGDQGDTGMPGAPGILGKEGQKGDLGPKGEKGETGLPGLKGDLGER
GKPGWNGTQGEKGDLGKIGPAGPSGLTGPMGQNGQKGEMGECPTGEKGEKGEAGLPGPPGPRGSVGPPGVNGSNGLPGPV
GLRGQLGSPGGKGEAGGRGPPGLRGMPGPKGEKGPKGPRGVRGPKGPQGETAEQIRSAFSVGLFPSKSFPPPGLPVKFDK
VLYNEEEHWDPMLSKFNCTHPGVYVFSYHITVRNRPLRAALVINGVKKLRTRDSLYGQDIDQASNLALLRLASGDQVWLE
TLRDWNGVYSSSEDDSTFTGFLLYADPK

However, if you count the number of unique proteins in the file, you get a more sensible figure (~80k):

$ grep '^>' eukaryote-protein-matching-IPR008160.fasta | cut -d'|' -f1 | sort -u | wc -l
79938

I have joined an archive containing the modified script (writes each sequence once only) and the FASTA file.

eukaryote-protein-matching-IPR008160.zip

jalan01 commented 2 years ago

Hi @matthiasblum, Fantastic! It all makes sense now. Thank you for the files. The issue is closed.

jalan01 commented 2 years ago

@matthiasblum, btw, the 79938 sequences contain another 2977 duplicate sequences.

matthiasblum commented 2 years ago

@jalan01 You are correct.

If I check the number of protein accessions in results.fasta (from the ZIP archive in posted in my previous command):

$ grep -c '>' results.fasta
79938

We find 79938 proteins, but how many unique sequences?

$ awk '/^>/ {printf("\n");next; } { printf("%s",$0);}  END {printf("\n");}' results.fasta | awk 'NF' | sort -u | wc -l
76961

76961 unique sequences, which means indeed some sequences are duplicated. Let's pick a random duplicated sequence:

$ awk '/^>/ {printf("\n");next; } { printf("%s",$0);}  END {printf("\n");}' results.fasta | awk 'NF' | sort | uniq -d | shuf | head -1
MGRPAGSLAPLRLLLLLALGHAWTYREEPQDGDREVCSENKIATTRYPCLKPTGELTTCFRKKCCKGYKFVLGQCIPEDYDVCAEAPCEQQCTDNFGRVLCTCYPGYRYDRERHRNREKPYCLDIDECAASNGTLCSQICVNTMGSYRCECHEGYTREADGKTCTKGDKAGLPEKSENVAKSGTCCASCKEFHQIKQTVLQLKQKVALLPNNAADLSKQITGEKVLASNAYIPGPPGQPGQQGPPGAPGPKGSQGVPGSPGPPGQPGPRGSMGPMGPSPDISHIKQGRRGPVGPPGAPGRDGSKGERGAPGPKGIPGPPGSFDFLLLMMADIRNDIAELQERVFGHRTHSSTEEFPLPQEFTNYHDTVDFGSGEDYKPRAAPRDSRVQKAAHP

And let's find which proteins have this sequence:

$ awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' results.fasta | awk 'NF' | grep -B 1 "MGRPAGSLAPLRLLLLLALGHAWTYREEPQDGDREVCSENKIATTRYPCLKPTGELTTCFRKKCCKGYKFVLGQCIPEDYDVCAEAPCEQQCTDNFGRVLCTCYPGYRYDRERHRNREKPYCLDIDECAASNGTLCSQICVNTMGSYRCECHEGYTREADGKTCTKGDKAGLPEKSENVAKSGTCCASCKEFHQIKQTVLQLKQKVALLPNNAADLSKQITGEKVLASNAYIPGPPGQPGQQGPPGAPGPKGSQGVPGSPGPPGQPGPRGSMGPMGPSPDISHIKQGRRGPVGPPGAPGRDGSKGERGAPGPKGIPGPPGSFDFLLLMMADIRNDIAELQERVFGHRTHSSTEEFPLPQEFTNYHDTVDFGSGEDYKPRAAPRDSRVQKAAHP"
>A0A7K6YW23|IPR008160|EGF-like domain-containing protein
MGRPAGSLAPLRLLLLLALGHAWTYREEPQDGDREVCSENKIATTRYPCLKPTGELTTCFRKKCCKGYKFVLGQCIPEDYDVCAEAPCEQQCTDNFGRVLCTCYPGYRYDRERHRNREKPYCLDIDECAASNGTLCSQICVNTMGSYRCECHEGYTREADGKTCTKGDKAGLPEKSENVAKSGTCCASCKEFHQIKQTVLQLKQKVALLPNNAADLSKQITGEKVLASNAYIPGPPGQPGQQGPPGAPGPKGSQGVPGSPGPPGQPGPRGSMGPMGPSPDISHIKQGRRGPVGPPGAPGRDGSKGERGAPGPKGIPGPPGSFDFLLLMMADIRNDIAELQERVFGHRTHSSTEEFPLPQEFTNYHDTVDFGSGEDYKPRAAPRDSRVQKAAHP
--
>A0A7L3S200|IPR008160|EGF-like domain-containing protein
MGRPAGSLAPLRLLLLLALGHAWTYREEPQDGDREVCSENKIATTRYPCLKPTGELTTCFRKKCCKGYKFVLGQCIPEDYDVCAEAPCEQQCTDNFGRVLCTCYPGYRYDRERHRNREKPYCLDIDECAASNGTLCSQICVNTMGSYRCECHEGYTREADGKTCTKGDKAGLPEKSENVAKSGTCCASCKEFHQIKQTVLQLKQKVALLPNNAADLSKQITGEKVLASNAYIPGPPGQPGQQGPPGAPGPKGSQGVPGSPGPPGQPGPRGSMGPMGPSPDISHIKQGRRGPVGPPGAPGRDGSKGERGAPGPKGIPGPPGSFDFLLLMMADIRNDIAELQERVFGHRTHSSTEEFPLPQEFTNYHDTVDFGSGEDYKPRAAPRDSRVQKAAHP
--
>A0A7L3UNP8|IPR008160|EGF-like domain-containing protein
MGRPAGSLAPLRLLLLLALGHAWTYREEPQDGDREVCSENKIATTRYPCLKPTGELTTCFRKKCCKGYKFVLGQCIPEDYDVCAEAPCEQQCTDNFGRVLCTCYPGYRYDRERHRNREKPYCLDIDECAASNGTLCSQICVNTMGSYRCECHEGYTREADGKTCTKGDKAGLPEKSENVAKSGTCCASCKEFHQIKQTVLQLKQKVALLPNNAADLSKQITGEKVLASNAYIPGPPGQPGQQGPPGAPGPKGSQGVPGSPGPPGQPGPRGSMGPMGPSPDISHIKQGRRGPVGPPGAPGRDGSKGERGAPGPKGIPGPPGSFDFLLLMMADIRNDIAELQERVFGHRTHSSTEEFPLPQEFTNYHDTVDFGSGEDYKPRAAPRDSRVQKAAHP

If we check UniProtKB, we can see that A0A7K6YW23, A0A7L3S200, and A0A7L3UNP8 are all proteins from species of the Alcidae family and share the same sequence. :)