RitchieLab / LOKI

0 stars 0 forks source link

Improve Pfam download speed #10

Closed van-truong closed 2 months ago

van-truong commented 3 months ago

The network speed seems to be throttled at <10mbps. This data source is huge. Pfam migrated to Interpro which means

check if we have the right data URLs if we're pulling multiple files, can we split it up to pull at the same time The URLs we have say HTTPS and FTP http://pfam.xfam.org/ https://www.ebi.ac.uk/interpro/download/pfam/ https://pfam-docs.readthedocs.io/en/latest/ftp-site.html https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/

van-truong commented 3 months ago

These are the original FTP urls we pull from in legacy LOKI code. It looks like these links are stable and exactly the same on InterPro site

All the files available on Pfam FTP site showing estimated file sizes: https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/

Pfam A txt file = 9.6MB Pfam A significant regions txt file = 5.1GB Pfam sequence txt file = 24 GB

ftp.ebi.ac.uk/pub/databases/Pfam/current_release/database_files/pfamA.txt.gz ftp.ebi.ac.uk/pub/databases/Pfam/current_release/database_files/pfamA_reg_full_significant.txt.gz ftp.ebi.ac.uk/pub/databases/Pfam/current_release/database_files/pfamseq.txt.gz

van-truong commented 3 months ago

There's a much smaller SQL file with the same filenames. Is it possible to sub those instead of the txt files?

Pfam A sql file = 1.1KB Pfam A significant regions sql file = 1KB Pfam sequence sql file = 1KB

https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/database_files/pfamA.sql.gz https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/database_files/pfamA_reg_full_significant.sql.gz https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/database_files/pfamseq.sql.gz

van-truong commented 3 months ago

TBD waiting on Marylyn's input to see if we will pull a different data file that's not 24GB

Download time takes ~ 11 hrs

But Li said in Brandon's timed log notes that Pfam takes less than 30 min to process & compact into SQLite schema

van-truong commented 3 months ago

Li said:

Those 3 sql dump file seems only containing table structure rather than the actual data. I was browsing ftp site and looking for another input option but wasn't successful. FYI, pfamseq file uses the column: pfam accession number, pfam id, name, group, and description according to the loader code. Below is 2 line example form pfamseq.txt:

A0A010PZJ8      A0A010PZJ8_9PEZI        1       4FD1CDFB3D9B202C        1bd30c447abb827881660b202f470352               COP9 signalosome complex subunit 2 {ECO:0000313|EMBL:EXF72907.1}        4       493     Colletotrichum fioriniae PJ7   Eukaryota; Fungi; Dikarya; Ascomycota; Pezizomycotina; Sordariomycetes; Hypocreomycetidae; Glomerellales; Glomerellaceae; Colletotrichum; Colletotrichum acutatum species complex.     0       MSDDEDFMQESDEEQYDFEYEEDDDEDSGDVGIENKYYNAKQLKLTDPEDAIAEFLGIPPLEEEKGEWGFKGLKQAIKLEFKLGQYAKATEHYAELLTYVKSAVTRNYSEKSINNMLDYIEKGSDSPKAVACVEKFYSLTLESFQSTNNERLWLKTNIKLAKLLLDRKDYNTVIKKLRDLHKACQKEDGSDDPSKGTYSMEIYALEIQMHAETKNNKQLKRLYQRALKVRSAVPHPKIMGIIRECGGKMHMSEENWAEAQTDFFESFRNYDEAGSLQRIQVLKYLLLTTMLVKSTINPFDSQETKPYKQDPRITAMTDLVDAYQRDDVHAYENVLQKNQDILADPFIAENIDEVTRNMRTKGVLKLIAPYTRMKLSWIAKQLKISEPEVQDILGFLIVDGKIQGKIDQQAGTLEIQSDADSDRTKALYELTQSVSTLYTTMFKEGEGFRSTEFPTDEQTMEMMGGGMTPRGGGRGQPRGVGRKGKGVVPSMWT      2023-09-29 07:13:44     NULL    1445577 NULL    NULL    0
A0A010PZK3      A0A010PZK3_9PEZI        1       F95471BD7D21C9C6        df96804c438684f33bedcc923bb6f636               Glycosyl hydrolase family 16 {ECO:0000313|EMBL:EXF72912.1}      4       512     Colletotrichum fioriniae PJ7   Eukaryota; Fungi; Dikarya; Ascomycota; Pezizomycotina; Sordariomycetes; Hypocreomycetidae; Glomerellales; Glomerellaceae; Colletotrichum; Colletotrichum acutatum species complex.     0       MLSQYTLSALAVLASLAQPALAQVSTKCNPMNTTCPADPAFGMDYNFNFNSTPSTDAWETTVGPVTYTSDNGAEFTISKQGDSPTIRSKFYFFWGRTEIHMRAAKGKGIVSSMMWLSDTLDEVDWEFLGIKNDALSNFFGKGVQDWHNGAEHPVTGSIQDDFHNYTCVWTKEKLEWWVDGNNVRTLLPKDANNSLAYPQTPMRLSLGIWAGGDPRMAAGTREWAGGDTDYAAGPYTMYVKSAQVTDYSSGKEYSFGDKTGSWESIKIAAGNSTVKEALLEEPSKSVSEKFNELSPTAKTAVYAGGVGVGCALIAFGLWYFIRQRRRGANEASLAAKRAEEERLELEGFHKRGVDPDSFAGATGTEYNAGAFSKDGMVQENTYSIPASQEKSAWGAAPMVAGAAGVGAAAGGMRSYSDNPNGHGQLMSPLRTQSPGMPPSGPLPMAPSRSASQGGYSRLGSPDGQQSPPPPMSPPSHGYSDHGFGGQQGYNNGGYGGASQGYFNNGGAQGGFR   2023-09-29 07:13:44     NULL    1445577 NULL    NULL    0
XueqiongLi commented 2 months ago

Our decision after group discussion is to provide loki.db database with pfam data included to users since pfam allows data redistribution. Now the file exists in github respiratory. We need to mention this in manual website as well.