Change how addArticles() loads stuff from files into new article metadata fields

davidskalinder commented 3 years ago

As mentioned in https://github.com/davidskalinder/mpeds-coder/issues/76#issuecomment-737591386_, 4eb4bcf implements loading the pub_date field from an article metadata file in a kludgey way: it decides whether to get the date by how many columns there are, so other new fields can't use the same solution.

So #76 and #78 will likely make this work by simply always inserting NULLs for the new fields, since we'll populate them from Solr anyway. But once all that is stable, we should solve this properly so that future articles can be loaded from files.

I suppose we could do this by requiring a certain column order, but requiring a title row with the proper columns seems like a much better idea to me...

Mentioning @alexhanna since she's the one who knows how articles are even loaded at all these days...

davidskalinder commented 3 years ago

Hey looky here, there's already an issue for what @alexhanna and I discussed this morning!

My inclination is to do what I said upthread and have the importer rely on a header row with the names as they appear in the new database table (since it's fairly easy to change them in the CSV)? Though if that turns out to be hard then I can go back to just expecting the correct column order. @alexhanna, let me know if any of that sounds stupid.

alexhanna commented 3 years ago

Nope, that makes sense to me.

davidskalinder commented 3 years ago

Hey also there's some changes you made to the addArticles() function in our production setup.py that I don't think have been committed anywhere -- looks like mostly some stuff to catch utf8 errors? Should I roll that stuff into the new setup.py file do we think?

alexhanna commented 3 years ago

Oh, whoops. Yes, probably.

davidskalinder commented 3 years ago

Okay will do

davidskalinder commented 3 years ago

Hmm, I'm stumbling a little while trying to decide how to handle missing columns: basically, whether to default them to null or to require that they exist and choke if they don't.

The following are the ones that could all come from the CSV:

  `title` varchar(1024) DEFAULT NULL,
  `db_id` varchar(255) DEFAULT NULL,
  `pub_date` date DEFAULT NULL,
  `publication` varchar(511) DEFAULT NULL,
  `source_description` varchar(511) DEFAULT NULL,
  `text` mediumtext CHARACTER SET utf8mb4,

It seems that out of those, it's only source_description that's really optional?

The one other hitch to this is that in theory we might, in the near future only, want to import a list of db_ids and then use the ETL tool to import the other fields from Solr. But I think we could still do this by creating a CSV with all the required columns and explicitly setting everything except db_id to be null? (Since I don't think we'll be validating the contents of the columns, except for whatever decode("utf8") does -- we'll only be checking that they exist.)

This might not matter for a while, but if anybody has a good argument for requiring the existence of certain columns or not, please pipe up...

davidskalinder commented 3 years ago

Looking pretty darn good at fa937f995. Will be tested for reals with the new article load, hopefully tonight; then PR'd as part of #59.

davidskalinder commented 3 years ago

PR'd in https://github.com/MPEDS/mpeds-coder/pull/49 and accepted. Deleting the branch and closing.

davidskalinder / mpeds-coder

Change how addArticles() loads stuff from files into new article metadata fields #112