Closed van-truong closed 2 months ago
These are the original FTP urls we pull from in legacy LOKI code. It looks like these links are stable and exactly the same on InterPro site
All the files available on Pfam FTP site showing estimated file sizes: https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/
Pfam A txt file = 9.6MB Pfam A significant regions txt file = 5.1GB Pfam sequence txt file = 24 GB
ftp.ebi.ac.uk/pub/databases/Pfam/current_release/database_files/pfamA.txt.gz ftp.ebi.ac.uk/pub/databases/Pfam/current_release/database_files/pfamA_reg_full_significant.txt.gz ftp.ebi.ac.uk/pub/databases/Pfam/current_release/database_files/pfamseq.txt.gz
There's a much smaller SQL file with the same filenames. Is it possible to sub those instead of the txt files?
Pfam A sql file = 1.1KB Pfam A significant regions sql file = 1KB Pfam sequence sql file = 1KB
https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/database_files/pfamA.sql.gz https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/database_files/pfamA_reg_full_significant.sql.gz https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/database_files/pfamseq.sql.gz
TBD waiting on Marylyn's input to see if we will pull a different data file that's not 24GB
Download time takes ~ 11 hrs
But Li said in Brandon's timed log notes that Pfam takes less than 30 min to process & compact into SQLite schema
Li said:
Those 3 sql dump file seems only containing table structure rather than the actual data. I was browsing ftp site and looking for another input option but wasn't successful. FYI, pfamseq file uses the column: pfam accession number, pfam id, name, group, and description according to the loader code. Below is 2 line example form pfamseq.txt:
A0A010PZJ8 A0A010PZJ8_9PEZI 1 4FD1CDFB3D9B202C 1bd30c447abb827881660b202f470352 COP9 signalosome complex subunit 2 {ECO:0000313|EMBL:EXF72907.1} 4 493 Colletotrichum fioriniae PJ7 Eukaryota; Fungi; Dikarya; Ascomycota; Pezizomycotina; Sordariomycetes; Hypocreomycetidae; Glomerellales; Glomerellaceae; Colletotrichum; Colletotrichum acutatum species complex. 0 MSDDEDFMQESDEEQYDFEYEEDDDEDSGDVGIENKYYNAKQLKLTDPEDAIAEFLGIPPLEEEKGEWGFKGLKQAIKLEFKLGQYAKATEHYAELLTYVKSAVTRNYSEKSINNMLDYIEKGSDSPKAVACVEKFYSLTLESFQSTNNERLWLKTNIKLAKLLLDRKDYNTVIKKLRDLHKACQKEDGSDDPSKGTYSMEIYALEIQMHAETKNNKQLKRLYQRALKVRSAVPHPKIMGIIRECGGKMHMSEENWAEAQTDFFESFRNYDEAGSLQRIQVLKYLLLTTMLVKSTINPFDSQETKPYKQDPRITAMTDLVDAYQRDDVHAYENVLQKNQDILADPFIAENIDEVTRNMRTKGVLKLIAPYTRMKLSWIAKQLKISEPEVQDILGFLIVDGKIQGKIDQQAGTLEIQSDADSDRTKALYELTQSVSTLYTTMFKEGEGFRSTEFPTDEQTMEMMGGGMTPRGGGRGQPRGVGRKGKGVVPSMWT 2023-09-29 07:13:44 NULL 1445577 NULL NULL 0
A0A010PZK3 A0A010PZK3_9PEZI 1 F95471BD7D21C9C6 df96804c438684f33bedcc923bb6f636 Glycosyl hydrolase family 16 {ECO:0000313|EMBL:EXF72912.1} 4 512 Colletotrichum fioriniae PJ7 Eukaryota; Fungi; Dikarya; Ascomycota; Pezizomycotina; Sordariomycetes; Hypocreomycetidae; Glomerellales; Glomerellaceae; Colletotrichum; Colletotrichum acutatum species complex. 0 MLSQYTLSALAVLASLAQPALAQVSTKCNPMNTTCPADPAFGMDYNFNFNSTPSTDAWETTVGPVTYTSDNGAEFTISKQGDSPTIRSKFYFFWGRTEIHMRAAKGKGIVSSMMWLSDTLDEVDWEFLGIKNDALSNFFGKGVQDWHNGAEHPVTGSIQDDFHNYTCVWTKEKLEWWVDGNNVRTLLPKDANNSLAYPQTPMRLSLGIWAGGDPRMAAGTREWAGGDTDYAAGPYTMYVKSAQVTDYSSGKEYSFGDKTGSWESIKIAAGNSTVKEALLEEPSKSVSEKFNELSPTAKTAVYAGGVGVGCALIAFGLWYFIRQRRRGANEASLAAKRAEEERLELEGFHKRGVDPDSFAGATGTEYNAGAFSKDGMVQENTYSIPASQEKSAWGAAPMVAGAAGVGAAAGGMRSYSDNPNGHGQLMSPLRTQSPGMPPSGPLPMAPSRSASQGGYSRLGSPDGQQSPPPPMSPPSHGYSDHGFGGQQGYNNGGYGGASQGYFNNGGAQGGFR 2023-09-29 07:13:44 NULL 1445577 NULL NULL 0
Our decision after group discussion is to provide loki.db
database with pfam data included to users since pfam allows data redistribution. Now the file exists in github respiratory. We need to mention this in manual website as well.
The network speed seems to be throttled at <10mbps. This data source is huge. Pfam migrated to Interpro which means
check if we have the right data URLs if we're pulling multiple files, can we split it up to pull at the same time The URLs we have say HTTPS and FTP http://pfam.xfam.org/ https://www.ebi.ac.uk/interpro/download/pfam/ https://pfam-docs.readthedocs.io/en/latest/ftp-site.html https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/