EOL / tramea

A lightweight server for denormalized EOL data
Other
2 stars 1 forks source link

Create a file that maps all hierarchy entries to EOL taxon concept (page) ids #162

Open KatjaSchulz opened 8 years ago

KatjaSchulz commented 8 years ago

This was requested by @dimus, @jhpoelen, @hyanwong. Ideally, this would be a Darwin Core Archive, but anything that gets the data out would do for now. @JRice suggested we could start with a straight dump of certain columns from the hierarchies table, i.e., identifier used by provider, hierarchy_entry_id, hierarchy_id, taxon_concept_id, what else?

Interested parties, please elaborate if necessary.

JRice commented 8 years ago

It's pretty big, even gzipped, at 120 MB or so.

jhammock commented 8 years ago

When we're sure we have all the requirements, we should schedule updates and post this somewhere :)

JRice commented 8 years ago

I would use a query like this:

select id,identifier,hierarchy_id,taxon_concept_id from hierarchy_entries where published = 1 and visibility_id = 1 into outfile 'identifiers.csv' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n';

It took about 25 seconds to run.

...The problem is, of course, that it will end up on the server where you run the command... for me, it was in /eol/data/db_slave3/var/lib/mysql/eol_production/identifiers.csv. From there, it will need to be moved somewhere useful. ...opendata.eol.org would be nice, but seems to be down more often than up, so we'll have to figure out where to put it. :(

KatjaSchulz commented 8 years ago

@JRice Just has a peek at the file. It looks like a lot of nodes don't have an identifier, i.e., column 2 is empty, e.g., for hierarchy 114. How can that be? Isn't the taxon identifier a required field?

JRice commented 8 years ago

An excellent question! Off the top of my head, I don't know. 114 is "EOL Group on Flickr", so perhaps there have been some exceptions to that rule for really weird content partners...

KatjaSchulz commented 8 years ago

Ah, that makes sense. However, it looks like there are a lot of hierarchies that have all or some identifiers missing: 114,115,118,119,120,113,121,124,125,130,132,133,138,140,141,143,144,393,394,396,397,398,127,400,401,402,403,405,410,430,431,433,436,439,440,444,448,449,453,428,454,155,459,463,464,465,467,476,507,508,518,522,537,542,548,553,556,558,563,562,564,565,566,567,570,571,582,587,588,549,589,590,595,601,617,622,625,628,635,636,639,643,648,653,661,690,695,696,697,707,711,716,717,720,721,726,727,731,734,735,714,740,753,777,778,787,798,799,804,808,814,821,826,833,837,838,840,841,842,843,844,852,854,857,861,864,868,869,875,879,880,883,890,893,894,896,901,902,912,913,925,929,930,931,932,933,934,935,936,937,938,939,940,941,942,943,944,945,946,131,956,957,962,970,974,977,975,982,984,986,987,988,994,998,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1017,1018,1019,1021,1022,1023,1024,1025,1026,1027,1028,1029,1030,1031,1032,1033,1034,1035,1066,1077,1076,884,1082,1089,1094,1099,1093,1100,1112,1115,604,1145,1154,1157,1158,1161,1164,1165,1166,1167,1208,1255,1256,1090,1261,1264,1269,1270,1272,1273,1277,1279,1282,1168,1285,1288,1289,1291,1169,1296,1297,1298,1301,1302,1303,1304,1305,1308,1309,1310,1311,1312,1313,1314,1315,1316,1317,1320,1322,1323,1325,1328,1332,1335,1344,1346,1354,1362,1367,1371,1377,1378,1381,1383,1388,1393,1394,1395,1400,1401,967,1402,1404,1406,1321,1412,1414,1423,1424,1425,1426,1427,1428,1431,1432,1433,1413,1460,1461,1462,1463,1466,1469,1472,1473,1474,1484,1485,1486,1487,1491,1429,1498,860,1499,1501

Unfortunately, I cannot look up the providers of these hierachies, because the Hierarchies API seems to be defunct.

JRice commented 8 years ago

That is a lot. :\ Most of them are non-browsable (maybe we should change the query to only include browsable hierarchies?) But some were:

...Does that shed any light?

KatjaSchulz commented 8 years ago

Hm, looks like a bunch of connectors, except Antweb, which is an old-style xml file. Antweb does have taxon identifiers, e.g.:

`

acanthognathus brevicornis https://www.antweb.org/description.do?genus=acanthognathus&name=brevicornis&rank=species Arthropoda Insecta Hymenoptera Formicidae Acanthognathus brevicornis` Also, paleodb provides identifiers (taxon_no) in the api we use to create the EOL import file, and I would be surprised if Eli didn't use them, but it may be worth checking with him. Example api response: "taxon_no","orig_no","record_type","associated_records","rank","taxon_name","common_name","status","parent_no","senior_no","reference_no","is_extant" "38613","38613","taxon","","genus","Tyrannosaurus","tyrant lizard","belongs to","38606","","9259","0"
JRice commented 8 years ago

Yup, this is worth turning into a bug—these might have all been "recent" harvests (since April), and there might be a bug. I'll do that -> #164 (we should reserve this ticket for talking about the file itself—what goes in it, where to put it, etc).

JRice commented 8 years ago

Hmmn... I could have sworn I'd put this in an earlier comment, but I was thinking that, since my quick experiment with JOINing a query to get the top images here didn't work, it might be worth a seprate output file for those... or two, actually. One would have to have "page_id" and "data_object_id" in it, the other would have to have "data_object_id" and a URL (or several, if it's convenient to include thumbnails) to the image.

Not much more work, but a bit more confusing, to put a "view_order" in the second file and have multiple data_object_ids per page, so you could get a few top images. ...Not sure if that helps.

JRice commented 8 years ago

Other information we could include in this file are rank ids and name ids... though you would need separate files (one each) to figure out what those meant. :S

jhpoelen commented 8 years ago

@KatjaSchulz thanks for opening this. In my mind, this is a duplicate, or at least related to, #10 . This issue also contains some concrete examples of an approach that is currently operational.

@JRice I am not too concerned about file sizes. I've stored archives on s3.amazon.com to get the archives web-accessible to anyone as part of an automated process. I'd be happy to show you if you are interested.

hyanwong commented 8 years ago

@JRice "if it was that easy, I wonder why we didn't get this a few years ago..." But seriously, thanks. This was also requested by @iimog I think.

Also moving my gitter comment here: "A DB dump with just hierarchy ids is v. useful. "Best image" ids are not so so useful, since 'best image' changes regularly, and can probably be better queried only as needed. Also that bit (getting the best image from an EoL page ID) can be done for multiple images in a single API call now, thanks to Pages API batch mode (https://github.com/EOL/tramea/issues/40)'

hyanwong commented 8 years ago

P.s. @deepreef is also interested in this.

hyanwong commented 8 years ago

@JRice I think, although I may be wrong, that EoL used to have gbif identifiers in here, and lots of them, under provider_id=800. I can't find anything like that in the file

grep ',800,' identifiers.csv

iimog commented 8 years ago

@JRice Thanks for the file. I downloaded it and for my use case it is perfect. Thanks @hyanwong for drawing my attention to this issue.

dimus commented 8 years ago

Some relevant information from JR from a few months ago:

What I want to do first -- to gather all the data directly from database

Okay. I'll go through what is where, then.

You probably want to look at hierarchy entries, which have a name_id right on them for their preferred scientific name.

hierarchy_entries have synonyms:

SELECT synonyms.* FROM synonyms WHERE synonyms.hierarchy_entry_id = ?

common names are more complicated; you have to go through the taxon_concept. Something like

SELECT * FROM taxon_concept_names INNER JOIN names ON names.id = taxon_concept_names.name_id INNER JOIN languages ON languages.id = taxon_concept_names.language_id WHERE taxon_concept_names.taxon_concept_id = 340732 AND (vern = 1 AND languages.iso_639_1 IS NOT NULL AND languages.iso_639_1 != '')

Here's where you can select which hierarchy entries to grab. you'll need to know the resource_id of each of those (easiest to get from the website, as an admin—search from the CP index page, then find the CP go to the resource pages to get the id... you'll see here that it's not always clear, as NCBI has two resources... I don't know which you want, or if you want both... it's a mystery.)

Each of those resources then has a hierarchy_id, so you'll need that. Once you have those, you can select the hierarchy_entries with those hierarchy_ids (you should also only select published HEs!), and on each of those is an "identifier" (that's the name of the field) which is the ID used on the resource's website (if applicable). If you want the full URL, you will have to build it off of the "outlink_uri" field on the hierarchy.

JRice commented 8 years ago

I'm going to try Joins again in the hopes of identifying the bad indexes... here's a sample, simple query based on the above:

select id,identifier,hierarchy_id,taxon_concept_id from hierarchy_entries where published = 1 and visibility_id = 1 limit 10 offset 130000;
JRice commented 8 years ago

Adding a JOIN immediately leaps query times up to 6-plus seconds for a result set of 10. :( Doing a tiny bit of digging into the joins themselves hasn't helped, either—I tried switching to a subselect to avoid the "weak" filters of published and visible, but that only made matters worse...

# HORRIBLE PERFORMANCE (I didn't even let it finish):
SELECT * FROM (
  SELECT he.id, he.identifier, he.hierarchy_id, he.taxon_concept_id,
    n.string
  FROM hierarchy_entries he
  LEFT JOIN names n ON (he.name_id = n.id) ) q
WHERE q.published = 1 AND q.visibility_id = 1
LIMIT 10 OFFSET 130000;

I did get a much better (usable, but not stellar: 1.45 sec with limit 10 offset 130000) result when I remove the LEFT part of the join (and go with straight joins). ...that might be adequate. (Aside: the code is FULL of left joins. ALL OVER. This might be a big factor in the overall speed of the site... though removing them would of course potentially result in missing data on the site, soooo... would need thinking.)

SELECT he.id, he.identifier, he.hierarchy_id, he.taxon_concept_id, n.string
FROM hierarchy_entries he
JOIN names n ON (he.name_id = n.id)
WHERE he.published = 1 AND he.visibility_id = 1
LIMIT 10 OFFSET 130000;

I'll try making another join and see if it gets much worse...

SELECT he.id, he.identifier, he.hierarchy_id, he.taxon_concept_id,
  n.string, d.object_cache_url, r.label
FROM hierarchy_entries he
JOIN names n ON (he.name_id = n.id)
JOIN top_images i ON (he.id = i.hierarchy_entry_id AND view_order = 1)
JOIN data_objects d ON (i.data_object_id = d.id)
JOIN translated_ranks r ON (he.rank_id = r.rank_id AND r.language_id = 152)
WHERE he.published = 1 AND he.visibility_id = 1
LIMIT 10 OFFSET 130000;

This one is horrible. ...It wasn't so bad when it was top_images only (3.97 sec), but adding dato truly killed it (2 min 8.42 sec), and throwing in the rank didn't hurt much more (it was probably cached, so: 1 min 12.19 sec). Adding into outfile 'identifiers.csv' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n'; created a file that took 6 min 20.73 sec, which is usable.

Update: second run (months later) took 3 min 21.32 sec. The file ends up in /eol/data/db_slave2/var/lib/mysql/eol_production/identifiers.csv.

JRice commented 8 years ago

For those of you who would like to see a file with a little more information in it, you can download the new version of the file here. It's a smaller file because there are fewer rows (if there's data missing, the row is missing—this is problematic when it comes to taxa with no image), so you might want both files, if you're looking for completeness... but this file has more information: the scientific name, the rank, and the "object cache URL", from which you can build a link to any (valid) size image, e.g.:

201201171519404 -> http://media.eol.org/content/2012/01/17/15/19404_260_190.jpg

(I'll let you figure out the pattern; it's pretty basic. Hint: the last two numbers are (max) width and height.)

jhpoelen commented 8 years ago

@JRice was trying to access the identifiers archive at http://beta.eol.org/uploads/data_search_files/big_identifiers.csv.gz and got a 404. Is there a different location I should use?

hyanwong commented 8 years ago

Sorry - this is a bit technical, but I'm a bit confused about IUCN on EoL. For example, take http://eol.org/pages/289808 (African elephant). This has a row in the identifiers file provided by @JRice that looks like:

24943193,"IUCN-12392",144,289808

Which correctly identifies it IUCN id as 12392 (e.g. see http://www.iucnredlist.org/details/12392/0). But this doesn't work for e.g. http://eol.org/pages/892213 (Sporophila murallae), which has

24963679,"IUCN-151170",144,892213 (151170 is no longer a valid IUCN ID: the correct IUCN id for this species is 22735320).

I assumed this was just because the IUCN hierarchy on EoL hasn't been updated in a while, but in fact there is a valid link to IUCN in EoL, as you can tell by going to http://eol.org/pages/892213/hierarchy_entries/24963679/overview, which provides a link to (http://www.iucnredlist.org/details/22735320/0) page in e.g. the distribution text. Is there any way for me to extract the number 22735320 from the EoL page ID 892213 and the heirarchy provider ID 144?

JRice commented 8 years ago

No there is not. You happened to spot an ID associated to a data object, and we don't aggregate those in any direct way to the taxon (and querying every source on every data object on every page to gather possible links isn't an option due to performance)... especially given the link we have from that content partner for that page is, indeed, http://www.iucnredlist.org/details/151170 (I just manually checked it again) ...the fact that it's wrong may point to bad data from the partner, but we don't have code sophisticated enough to identify that and go looking for it elsewhere.

Sorry. :(

It will presumedly be fixed on a re-harvest, if the partner has updated that problem in the data they're providing us.

(Also, I can't believe Google's spellcheck prefers presumably to presumedly. Alas.)

JRice commented 8 years ago

Haaaaaang on. ...It looks like that page has multiple IUCN entries. ...Are you sure the file you're looking at doesn't have other IDs for that taxon page?

Birdlife International links resource Birdlife International links resource 721
Clements Checklist resource Clements Checklist resource 1128
Environments Environments 1317
IUCN Red List IUCN Red List 622
IUCN Red List IUCN Red List (Species Assessed for Global Conservation) 144
BioLib.cz Import Initial BioLib.cz Import 394
Integrated Taxonomic Information System (ITIS) Integrated Taxonomic Information System (ITIS) 903
Species 2000 & ITIS Catalogue of Life: April 2013 Species 2000 & ITIS Catalogue of Life: April 2013 1188
IUCN Structured Data Structured data 1344
TaxonConcept.org outlinks TaxonConcept.org outlinks 828
Wikipedia Wikipedia 431
hyanwong commented 8 years ago

Erm, not sure. But certainly if I query the "correct" iucn number (22735320) against the search_by_provider API, it doesn't match anything. I'm wondering how EoL manages to match the IUCN taxon to the page. I guess it must be via the scientific name?

hyanwong commented 8 years ago

Oh! I see what you mean I should be using hierarchy 622 not 144. Doh. Thanks. In my defence, 622 doesn't crop up in http://eol.org/api/docs/provider_hierarchies, which is decidedly odd.

JRice commented 8 years ago

Also, FWIW, yes: by scientific name. Our pages are all built entirely and exclusively by matching scientific names (with some help from ancestry—also by scientific name), and occasionally by synonyms (only for some resources).

We do not, at present, match IDs between providers.

jhammock commented 8 years ago

blocked by #281

KatjaSchulz commented 8 years ago

We just had another request from @diatomsRcool for a file with all EOL taxon concept ids and associated taxon names. Is this file still around somewhere or can it be recovered?

hyanwong commented 8 years ago

I have a copy. Shall I send it to @diatomsRcool or is a new one soon in the pipeline?

KatjaSchulz commented 8 years ago

That would be great thanks!

diatomsRcool commented 8 years ago

What are the column headers here? Looks like 4 columns...

JRice commented 8 years ago

What I did on the last run:

mysql_2

# Note this took 17 min 22.41 sec last time.
SELECT he.id, he.identifier, he.hierarchy_id, he.taxon_concept_id,
  n.string, d.object_cache_url, r.label
FROM hierarchy_entries he
LEFT JOIN names n ON (he.name_id = n.id)
LEFT JOIN top_images i ON (he.id = i.hierarchy_entry_id AND view_order = 1)
LEFT JOIN data_objects d ON (i.data_object_id = d.id)
LEFT JOIN translated_ranks r ON (he.rank_id = r.rank_id AND r.language_id = 152)
WHERE he.published = 1 AND he.visibility_id = 1
INTO OUTFILE 'identifiers_with_images.csv'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n';
SELECT he.id, he.identifier, he.hierarchy_id, he.taxon_concept_id,
  n.string
FROM hierarchy_entries he
LEFT JOIN names n ON (he.name_id = n.id)
WHERE he.published = 1 AND he.visibility_id = 1
INTO OUTFILE 'identifiers.csv'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n';

quit

sudo su -

mv /eol/data/db_slave2/var/lib/mysql/eol_production/identifiers.csv ~jrice/.
chown jrice:jrice !$
mv /eol/data/db_slave2/var/lib/mysql/eol_production/identifiers_with_images.csv ~jrice/.
chown jrice:jrice !$
exit

scp identifiers*.csv eol-app1:~/.
app1

sudo mv ~jrice/identifiers*.csv /eol/data/app/public/.
sudo cp /eol/data/app/public/identifiers*.csv /eol/data/app/log/beta/.

beta

mv log/identifiers*.csv public/.
chmod 744 public/identifiers*.csv
exit

exit

rm identifiers*.csv
jhammock commented 8 years ago

Most recent version: http://beta.eol.org/identifiers.csv

iimog commented 7 years ago

Hey, I use the identifiers.csv file a lot to cross-map identifiers. However my version of the file is a little outdated. The link of the most recent version by @jhammock returns a 404 now. Is it possible to get the latest version of the file somewhere?

hyanwong commented 7 years ago

@jhammock mentioned that there was some new stuff in the pipeline, so it might just be on its way to a new home.

jhammock commented 7 years ago

Sorry, not sure what happened to that file. The replacement files are still in prep.

iimog commented 7 years ago

All right. Thank you. I'll be patient.

eliagbayani commented 7 years ago

Hi @hyanwong, @iimog , Please check here latest identifiers.csv. identifiers.csv in opendata.eol.org

If you have an old copy of this file. Can you please tell me if this latest has more or less records. That is, once you've downloaded this latest. Thanks.

hyanwong commented 7 years ago

I can't seem to get that file using wget or cURL, annoyingly. Is there something weird about the URL?

jhpoelen commented 7 years ago

@hyanwong this is consistent with https://github.com/jhpoelen/eol-globi-data/issues/274 . I suspect that the portal is designed for humans to do "click to download" rather than having machines do curl/wget .

jhammock commented 7 years ago

Hmm... I think we were expecting machine readable services also from CKAN. Eli, is there a solution in our CKAN arsenal?

hyanwong commented 7 years ago

@jhpoelen I guess you didn't figure out a way to get either wget or cURL to do it, via some fancy flag setting?

hyanwong commented 7 years ago

Looks like slightly more records (lines)

yans-air:EOL yan$ wc identifiers.csv
 12564087 45169123 854534282 identifiers.csv
yans-air:EOL yan$ wc identifiers_prev.csv 
 12513903 45117019 851901860 identifiers_prev.csv
jhpoelen commented 7 years ago

@hyanwong I was actually able to download the archive using curl by copy-pasting the request made by the opendata site to start the download (see https://github.com/jhpoelen/eol-globi-data/issues/274#issuecomment-274671229). However, it seemed that this download (and the manual download) time-ed out after 10 mins, leaving a partial archive. So, with my crappy consumer grader internet connection, I was unable to access the archive. I had to use an external server to download the file, then transfer the file to my development system from that server. Hope this helps!

eliagbayani commented 7 years ago

Thanks @hyanwong and @jhpoelen for your inputs. Very helpful.

Hi @hyanwong , slightly more records is about right. Since we didn't have a real harvest for sometime now.

Regarding problem on serving files for download, I will consult our sysadmin to improve service for clients with lower bandwidth. One non-technical sol'n is to divide big files into smaller chunks for easier download. Where either wget, curl or http should work.

I just increased Apache Timeout but that didn't seem to work either. I keep on getting a partial 100-140 MB file size downloaded.

I'm able to wget smaller files e.g. reptiles

hyanwong commented 7 years ago

Is it a bandwidth problem? I can download fine with a normal browser, just not with wget / curl?

iimog commented 7 years ago

I have the same problem as @hyanwong click download works (takes ~90s from my university), both curl and wget return an Error 503 Service Unavailable. I don't think it is a timeout problem. But in general: nice to have this file in the opendata.eol.org portal.

eliagbayani commented 7 years ago

Thanks for the input @iimog.