Closed davidskalinder closed 3 years ago
Hey looky here, there's already an issue for what @alexhanna and I discussed this morning!
My inclination is to do what I said upthread and have the importer rely on a header row with the names as they appear in the new database table (since it's fairly easy to change them in the CSV)? Though if that turns out to be hard then I can go back to just expecting the correct column order. @alexhanna, let me know if any of that sounds stupid.
Nope, that makes sense to me.
Hey also there's some changes you made to the addArticles()
function in our production setup.py
that I don't think have been committed anywhere -- looks like mostly some stuff to catch utf8 errors? Should I roll that stuff into the new setup.py
file do we think?
Oh, whoops. Yes, probably.
Okay will do
Hmm, I'm stumbling a little while trying to decide how to handle missing columns: basically, whether to default them to null or to require that they exist and choke if they don't.
The following are the ones that could all come from the CSV:
`title` varchar(1024) DEFAULT NULL,
`db_id` varchar(255) DEFAULT NULL,
`pub_date` date DEFAULT NULL,
`publication` varchar(511) DEFAULT NULL,
`source_description` varchar(511) DEFAULT NULL,
`text` mediumtext CHARACTER SET utf8mb4,
It seems that out of those, it's only source_description
that's really optional?
The one other hitch to this is that in theory we might, in the near future only, want to import a list of db_id
s and then use the ETL tool to import the other fields from Solr. But I think we could still do this by creating a CSV with all the required columns and explicitly setting everything except db_id
to be null? (Since I don't think we'll be validating the contents of the columns, except for whatever decode("utf8")
does -- we'll only be checking that they exist.)
This might not matter for a while, but if anybody has a good argument for requiring the existence of certain columns or not, please pipe up...
Looking pretty darn good at fa937f995. Will be tested for reals with the new article load, hopefully tonight; then PR'd as part of #59.
PR'd in https://github.com/MPEDS/mpeds-coder/pull/49 and accepted. Deleting the branch and closing.
As mentioned in https://github.com/davidskalinder/mpeds-coder/issues/76#issuecomment-737591386_, 4eb4bcf implements loading the
pub_date
field from an article metadata file in a kludgey way: it decides whether to get the date by how many columns there are, so other new fields can't use the same solution.So #76 and #78 will likely make this work by simply always inserting
NULL
s for the new fields, since we'll populate them from Solr anyway. But once all that is stable, we should solve this properly so that future articles can be loaded from files.I suppose we could do this by requiring a certain column order, but requiring a title row with the proper columns seems like a much better idea to me...
Mentioning @alexhanna since she's the one who knows how articles are even loaded at all these days...