ElixirTeSS / TeSS

Training e-Support Service using Ruby on Rails.
Other
12 stars 15 forks source link

Refactor newtessscraper #233

Closed njall closed 7 years ago

njall commented 8 years ago

Need to do a big refactor of the scraper repository.

Have common structure for each scraper. Have template with common structure ready to copy for creating a new scraper. E.g. cp scraper_template.rb my_new_scraper.rb Rename repository (TeSS_scrapers?) Rename scrapers (no more improved_goblet etc) Rename gem folder from tessdata to something else. Release RDFa extractor stuff as a separate gem. Turn as many routine operations in the scrapers into a library of methods in the gem

knirirr commented 8 years ago

"TeSS Scrapers" (or something along those lines) sounds fine to me. Which goblet scraper is the one we're currently using? I could remove the rest and rename whichever is kept. Would tess_scraper be an appropriate name for the gem folder? I can deal with refactoring the evil scrapers, but would you, @njall, like to do the RDFa one as you wrote that code? If not I could make that a separate tess_rdfa_scraper gem. A common structure sounds easy enough, and most routine operations are in the gem anyway.

njall commented 8 years ago

Which goblet scraper is the one we're currently using? I could remove the rest and rename whichever is kept.

goblet_rdfa_scraper is the one we currently use. The goblet_api one may still be of use though. Maybe keep both and name one with api and one with rdfa.

Would tess_scraper be an appropriate name for the gem folder?

The folder kind of contains the stuff for uploading to TeSS whereas the actual scraping goings-on is in the root folder. Thinking about other people (not us) using this gem to upload their materials to TeSS, maybe tess_api or tess_api_client or something would be a better name? @anenadic @fbacall any thoughts?

I can deal with refactoring the evil scrapers, but would you, @njall, like to do the RDFa one as you wrote that code? If not I could make that a separate tess_rdfa_scraper gem.

They're very similar formats. I put all the RDFa extraction code into a tess gem so that's just one method. Everything else should be made to look the same as the regular html scrapers - such as page caching, debug mode, index page URL extraction, etc, etc.

One other thing. At the moment the Material.new or Event.new methods you have to specify every field in order. If you don't have a value you have to pass nill otherwise the order will be wrong and everything will mess up. It would be better if you could pass a hash of values to the initialize method so you could just pass what you need e.g Material.new( {title: 'horse', description: 'tall runny thing', keywords: ['tall', 'hoof']} )

knirirr commented 8 years ago

Thanks, I agree about the Material.new &c. point, that does need dealing with.

fbacall commented 8 years ago

The folder kind of contains the stuff for uploading to TeSS whereas the actual scraping goings-on is in the root folder. Thinking about other people (not us) using this gem to upload their materials to TeSS, maybe tess_api or tess_api_client or something would be a better name? @anenadic @fbacall any thoughts?

+1 for "tess_api_client". I would also have it in a separate repository rather than just a folder.

knirirr commented 8 years ago

I've renamed the scraper repository and split the gem out to a separate one in preparation for refactoring as discussed above.

knirirr commented 8 years ago

I've change the initialisation and updated the scrapers to match. When there's next time I'll have to go through the scrapers and extract anything out which could usefully go in the API gem.

knirirr commented 8 years ago

I was looking at this again and I'm not sure if there's much we need to do to standardise scrapers components now. Does anyone have any suggestions to the contrary?

njall commented 8 years ago

Hey, I'm can't run any of the scrapers. Getting a mix of 500 and 422 errors. I think the hash is a title key with the whole object as a value. Something like: {title: {title: 'blah'}}

njall commented 8 years ago

More specifically:

ERROR: PAYLOAD: {"title":{"title":"Flow Cytometry 2013 Module 3 - Preprocessing and Quality Assurance of FCM Data","url":"http://www.mygoblet.org/training-portal/materials/flow-cytometry-2013-module-3-preprocessing-and-quality-assurance-fcm-data","short_description":"\n\t\tPreprocessing\n\n\t\t\t\tRemoving margin events\n\n\t\t\t\tData transformation: log vs. biexponential\n\n\t\t\t\tData normalization\n\n\n\t\tQuality Assurance\n\n\t\t\t\tOverview of quality assurance concepts: total raw/viable cell count, margin event count, outlier detection based on density of common parameters\n\n\t\t\t\tBuilding quality assurance objects using flowQ and generating summary HTML reports\n\n","doi":null,"remote_updated_date":"2016-06-10 16:09:26 +0100","remote_created_date":"2013-06-26T20:25:33+01:00","content_provider_id":4,"scientific_topic":null,"keywords":["Preprocessing","Quality Assurance","Flow cytometry data"],"licence":null,"difficulty_level":null,"contributors":[],"authors":["Michelle Brazas"],"target_audience":null},"url":null,"short_description":null,"doi":null,"remote_updated_date":null,"remote_created_date":null,"content_provider_id":null,"scientific_topic_names":[],"keywords":[],"licence":null,"difficulty_level":null,"contributors":[],"authors":[],"target_audience":[],"id":null,"long_description":null} Upload failed: 422

njall commented 8 years ago

I'm having this problem on tess.elixir-uk.org and my machine. I think it's something to do with the gem installation not being updated possible?

knirirr commented 8 years ago

Yes, you probably need to update the gem - the new format is working for me on my local installation. Not at a computer now and so can't check.

knirirr commented 8 years ago

Is this now working? I've got an email from Github suggesting that it is but I can't see it in this thread.