Bahler-Lab / yogy

A web-based resource for orthologous proteins of eukaryotic organisms.
0 stars 0 forks source link

link broken: inparanoid #7

Open sinanshi opened 8 years ago

sinanshi commented 8 years ago

broken link: http://inparanoid.cgb.ki.se/download/current/sqltables/ https://github.com/Bahler-Lab/yogy/blob/master/YogiUp/run_db.csh#L40

dbitton commented 8 years ago

what the old file looks like?

sinanshi commented 8 years ago

They have a list of files called inparanoid_files.txt. You get something like this,

sqltable.M.musculus.fa-P.trichocarpa.fa
sqltable.M.mulatta.fa-P.troglodytes.fa
sqltable.A.thaliana.fa-M.musculus.fa
sqltable.A.thaliana.fa-T.nigroviridis.fa
sqltable.D.melanogaster.fa-D.rerio.fa
sqltable.S.purpuratus.fa-T.nigroviridis.fa

files are like this,

1       2740    C.hominis.fa    1.000   Chro.30328      100%
1       2740    Y.lipolytica.fa 1.000   YALI0C10648g    100%
2       1617    C.hominis.fa    1.000   Chro.30301      100%
2       1617    Y.lipolytica.fa 1.000   YALI0C16566g    100%
3       1239    C.hominis.fa    1.000   Chro.80425      100%
3       1239    Y.lipolytica.fa 1.000   YALI0C11407g    100%
4       1116    C.hominis.fa    1.000   Chro.60382      100%
4       1116    Y.lipolytica.fa 1.000   YALI0C22550g    100%
5       1049    C.hominis.fa    1.000   Chro.80341      100%
5       1049    Y.lipolytica.fa 1.000   YALI0A00352g    100%
5       1049    Y.lipolytica.fa 0.588   YALI0A20152g
6       1025    C.hominis.fa    1.000   Chro.10043      100%
6       1025    Y.lipolytica.fa 1.000   YALI0F12155g    100%
7       923     C.hominis.fa    1.000   Chro.80150      100%
7       923     Y.lipolytica.fa 1.000   YALI0A17127g    100%
8       913     C.hominis.fa    1.000   Chro.20010      100%
8       913     Y.lipolytica.fa 1.000   YALI0D08184g    99%
8       913     Y.lipolytica.fa 0.846   YALI0F25289g
8       913     Y.lipolytica.fa 0.817   YALI0E35046g
8       913     Y.lipolytica.fa 0.739   YALI0D22352g
9       891     C.hominis.fa    1.000   Chro.60546      100%
9       891     Y.lipolytica.fa 1.000   YALI0B20724g    100%
10      854     C.hominis.fa    1.000   Chro.20293      100%
10      854     Y.lipolytica.fa 1.000   YALI0B13904g    100%
11      852     C.hominis.fa    1.000   Chro.30427      100%
11      852     Y.lipolytica.fa 1.000   YALI0C07953g    100%
12      846     C.hominis.fa    1.000   Chro.60284      100%
12      846     Y.lipolytica.fa 1.000   YALI0F04169g    100%
13      842     C.hominis.fa    1.000   Chro.30434      100%
13      842     Y.lipolytica.fa 1.000   YALI0A00264g    100%
14      767     C.hominis.fa    1.000   Chro.10119      100%
14      767     Y.lipolytica.fa 1.000   YALI0A17941g    100%
15      755     C.hominis.fa    1.000   Chro.30389      100%
15      755     Y.lipolytica.fa 1.000   YALI0C17347g    99%
16      753     C.hominis.fa    1.000   Chro.80361      100%
16      753     Y.lipolytica.fa 1.000   YALI0F20218g    100%
17      753     C.hominis.fa    1.000   Chro.80294      100%
17      753     Y.lipolytica.fa 1.000   YALI0D12210g    99%
18      735     C.hominis.fa    1.000   Chro.80471      100%
18      735     Y.lipolytica.fa 1.000   YALI0B15642g    99%
19      725     C.hominis.fa    1.000   Chro.30216      100%

Files can be found on the old server /home/sk11/load2/inptables/inparanoid.sbc.su.se/download/7.0_current/

dbitton commented 8 years ago

could you trace it here? http://inparanoid.sbc.su.se/download/current/

sinanshi commented 8 years ago

I found them. http://inparanoid.sbc.su.se/download/current/Orthologs_other_formats/A.aegypti/ You have to download the .tar.gz file first, extract them and you can find the table in it. I guess we have make this by ourself.

sinanshi commented 8 years ago

I'm still downloading the files. But they are too many. I don't know if it is due to the recent update. There are 4959 files in the old folder, while in for the new data, only counting for entries starting with A, we have already 5409 files. It means that we can expect the number of files will be around 25 time more than the original one. Does that make sence?

dbitton commented 8 years ago

mmmm not sure, maybe not all files are needed.....

sinanshi commented 8 years ago

It has been already one hour since I started downloading, and now it's still at B. We can expect a 10 hours downloading time and around 20-30 Gb storage.

sinanshi commented 8 years ago

Or..., we can use the old data, which can be found here, http://inparanoid.sbc.su.se/download/old_versions/data_7.0/

This is exactely the same data as the old one.

dbitton commented 8 years ago

no, the whole point is to update...

sinanshi commented 8 years ago

aparently the latest version has more than two times gene sequences than the previous one. 7.0 2009-06 100 1687023 8.0 2013-12 273 3718323

dbitton commented 8 years ago

well it is what it is I guess, updates should be different, I guess that it is not one version. Inparanoid go 7.0.1 etc, YOGY has not been updated for the last 6 years,,,,

sinanshi commented 8 years ago

Size too big for the server, now try to download locally. Hopefully the size of files will be reduced after processing.

dbitton commented 8 years ago

I suggest you come here this afternoon, maybe around 4pm and we could look at it together

sinanshi commented 8 years ago

The difference between version 7.0 and 8.0

7.0

1       4101    A.aegypti.fa    1.000   AAEL009959-PA   100%
1       4101    A.thaliana.fa   1.000   At1g80070.1     100%
1       4101    A.thaliana.fa   0.462   At4g38780.1
2       2380    A.aegypti.fa    1.000   AAEL011187-PA   100%
2       2380    A.thaliana.fa   1.000   At1g20960.1     100%

8.0

1       4114    A.aegypti       1.000   Q16UB0  100%
1       4114    A.thaliana      1.000   Q9SSD2  100%
1       4114    A.thaliana      0.490   F4JUG5
2       2380    A.aegypti       1.000   Q16QS5  100%
2       2380    A.thaliana      1.000   Q9SYP1  100%
2       2380    A.thaliana      0.623   O48534

Both 7.0 and 8.0 will give error like this:

DBD::mysql::st execute failed: Duplicate entry '1---AAEL007132-PA' for key 'PRIMARY' at perl/yogy_add_inp_terms.pl line 118, <FILE> line 1.
DBD::mysql::st execute failed: Duplicate entry '2---AAEL008855-PA' for key 'PRIMARY' at perl/yogy_add_inp_terms.pl line 118, <FILE> line 3.
DBD::mysql::st execute failed: Duplicate entry '5---AAEL000307-PA' for key 'PRIMARY' at perl/yogy_add_inp_terms.pl line 118, <FILE> line 10.
dbitton commented 8 years ago

Well maybe try to keep it as similar as possible to version 7, so you could add the .fa to column 3, column 5 should remain as is. Not sure about the duplicate entries error...

sinanshi commented 8 years ago

So do you suggest that we just ignor the error?

dbitton commented 8 years ago

well not sure, but if the previous version gives the same, we might...

sinanshi commented 8 years ago

What is the difference between perl/yogy_add_inp_terms.pl and perl/add_inp_terms-old.pl and which one should I use for updating inparanoid?

dbitton commented 8 years ago

I have no clue, just run the second one then with the new Inparanoid files :-)

sinanshi commented 8 years ago

will do.

sinanshi commented 8 years ago

Tried the old script last week, which takes two days to run. I checked cdc10 this morning, and inparanoid table doesn't show up. The newly updated inparanoid_member table looks quite different from the old one. I don't know if it is just caused by the difference of data. Please let me know if you can spot anything wrong here.

new table

cluster_nr main_ortholog_score organism organism_pair inparalog_score uniprot_id
1 1083 A.aegypti A 1 Q16FG2
1 1083 A.aeolicus A 1 O67512
2 608 A.aegypti A 1 Q1HRQ7
2 608 A.aeolicus A 1 O66907
3 591 A.aegypti A 1 Q0IFX5
3 591 A.aeolicus A 1 O67618
4 590 A.aegypti A 1 Q17FL3
4 590 A.aegypti A 1 Q17H12
4 590 A.aeolicus A 1 O67828
5 581 A.aegypti A 1 Q17D48
5 581 A.aegypti A 0.096 Q16HA3
5 581 A.aegypti A 0.096 Q16J19
5 581 A.aeolicus A 1 O67411

old table

cluster_nr main_ortholog_score organism organism_pair inparalog_score uniprot_id
1 5097 ensAG ensAG-ensCE 1 AGAP002015-PA
1 5097 ensCE ensAG-ensCE 1 CE23997
2 4521 ensAG ensAG-ensCE 1 AGAP001633-PA
2 4521 ensCE ensAG-ensCE 1 CE33018
3 4471 ensAG ensAG-ensCE 1 AGAP010750-PA
3 4471 ensCE ensAG-ensCE 1 CE43332
4 4245 ensAG ensAG-ensCE 1 AGAP006885-PA
4 4245 ensCE ensAG-ensCE 1 CE00122
5 3139 ensAG ensAG-ensCE 1 AGAP000331-PA
5 3139 ensCE ensAG-ensCE 1 CE05765
6 3087 ensAG ensAG-ensCE 1 AGAP006686-PA
6 3087 ensCE ensAG-ensCE 1 CE07373
7 2612 ensAG ensAG-ensCE 1 AGAP001519-PA
7 2612 ensCE ensAG-ensCE 1 CE21971
sinanshi commented 8 years ago

I had a feeling that we don't need that many inparanoid data from http://inparanoid.sbc.su.se/download/current. I don't know if we need all the following species.

A.aegypti/ C.elegans/ D.virilis/ K.lactis/ N.gruberi/ P.sorbitophila/ T.adhaerens/ A.aeolicus/ C.familiaris/ D.willistoni/ K.pastoris/ N.haematococca/ P.tetraurelia/ T.annulata/ A.anophagefferens/ C.floridanus/ E.aedis/ L.africana/ N.leucogenys/ P.trichocarpa/ T.asahii/ A.bisporus/ C.gigas/ E.bieneusi/ L.bicolor/ N.parisii/ P.tricornutum/ T.blattae/ A.capsulata/ C.glabrata/ E.caballus/ L.braziliensis/ N.vectensis/ P.tritici-repentis/ T.brucei/ A.carolinensis/ C.globosum/ E.coli/ L.chalumnae/ N.vitripennis/ P.troglodytes/ T.castaneum/ A.cephalotes/ C.gloeosporioides/ E.cuniculi/ L.elongisporus/ O.anatinus/ P.ultimum/ T.chinensis/ A.darlingi/ C.griseus/ E.cymbalariae/ L.infantum/ O.cuniculus/ P.vivax/ T.cruzi/ A.delicata/ C.hominis/ E.dermatitidis/ L.interrogans/ O.dioica/ P.yoelii/ T.delbrueckii/ A.echinatior/ C.immitis/ E.histolytica/ L.loa/ O.garnettii/ R.baltica/ T.gondii/ A.gambiae/ C.intestinalis/ E.nidulans/ L.maculans/ O.latipes/ R.communis/ T.guttata/ A.gossypii/ C.jacchus/ E.siliculosus/ L.major/ O.niloticus/ R.delemar/ T.heterothallica/ A.gypseum/ C.japonica/ F.catus/ L.thermotolerans/ O.sativa/ R.glutinis/ T.hominis/ A.kawachii/ C.lusitaniae/ F.nucleatum/ M.acetivorans/ O.tauri/ R.norvegicus/ T.maritima/ A.melanoleuca/ C.militaris/ F.pseudograminearum/ M.acridum/ P.abelii/ Salpingoeca.sp./ T.melanosporum/ A.mellifera/ C.neoformans/ F.radiculosa/ M.brevicollis/ P.aerophilum/ S.bicolor/ T.nigroviridis/ A.oligospora/ C.owczarzaki/ G.aculeatus/ M.brunnea/ P.aeruginosa/ S.cerevisiae/ T.parva/ A.pisum/ C.parvum/ G.clavigera/ M.domestica/ P.alecto/ S.coelicolor/ T.pseudonana/ A.queenslandica/ C.porcellus/ G.destructans/ M.gallopavo/ P.berghei/ S.commune/ T.rubripes/ A.thaliana/ C.quinquefasciatus/ G.gallus/ M.globosa/ P.brasiliensis/ S.harrisii/ T.rubrum/ B.bassiana/ C.reinhardtii/ G.gorilla/ M.graminicola/ P.carnosa/ S.invicta/ T.spiralis/ B.bovis/ C.remanei/ G.graminis/ M.guilliermondii/ P.chabaudi/ S.italica/ T.stipitatus/ B.dendrobatidis/ C.savignyi/ G.intestinalis/ Micromonas.sp./ P.digitatum/ S.lacrymans/ T.thermophila/ B.distachyon/ C.sinensis/ G.lozoyensis/ M.jannaschii/ P.falciparum/ S.lycopersicum/ T.vaginalis/ B.floridae/ C.trachomatis/ G.max/ M.larici-populina/ P.graminis/ S.macrospora/ T.yellowstonii/ B.fuckeliana/ C.variabilis/ G.sulfurreducens/ M.lucifugus/ P.humanus/ S.mansoni/ U.maydis/ B.hominis/ D.ananassae/ G.violaceus/ M.mulatta/ P.indica/ S.moellendorffii/ U.reesii/ B.japonicum/ D.discoideum/ G.zeae/ M.musculus/ P.infestans/ S.passalidarum/ V.carteri/ B.malayi/ D.grimshawi/ H.arabidopsidis/ M.oryzae/ P.jiroveci/ S.pombe/ V.corneae/ B.mori/ D.hansenii/ H.glaber/ M.osmundae/ P.knowlesi/ S.purpuratus/ V.culicis/ B.rapa/ D.melanogaster/ H.salinarum/ M.perniciosa/ P.kodakaraensis/ S.reilianum/ V.dahliae/ B.subtilis/ D.mojavensis/ H.saltator/ M.phaseolina/ P.marinus/ S.sclerotiorum/ V.polyspora/ B.taurus/ D.plexippus/ H.sapiens/ M.putorius/ P.nodorum/ S.scrofa/ V.vinifera/ B.thetaiotaomicron/ D.pseudoobscura/ H.virens/ M.tuberculosis/ P.pacificus/ S.solfataricus/ W.bancrofti/ C.albicans/ D.pulex/ H.vulgare/ N.caninum/ P.pallidum/ S.stipitis/ W.ciferrii/ C.aurantiacus/ D.purpureum/ I.multifiliis/ N.castellii/ P.patens/ stderr/ W.sebi/ C.brenneri/ D.radiodurans/ I.scapularis/ N.ceranae/ P.placenta/ S.tridecemlineatus/ X.maculatus/ C.briggsae/ D.rerio/ K.africana/ N.crassa/ P.ramorum/ S.tuberosum/ X.tropicalis/ C.cinerea/ D.turgidum/ K.cryptofilum/ N.fumigata/ P.sojae/ Synechocystis.sp./ Y.lipolytica/