davidskalinder / mpeds-coder

MPEDS Annotation Interface
MIT License
0 stars 0 forks source link

Export file with Solr ID and article text from MAI #110

Open davidskalinder opened 4 years ago

davidskalinder commented 4 years ago

After #59 has killed Solr, we will need to reimplement a non-Solr version of #45.

This might be as simple as just exporting the article-metadata table that will contain the article text, although we might have to do some checks to make sure MS Access can import it without too much grief.

NB this might be kinda low-priority since we can just keep using #45's solution and doing it the old way, although that process is dependent on Solr and so would require it to remain not quite dead. So we should do it eventually, but probably not as urgently as getting rid of the rest of MAI's Solr dependencies.

Mentioning @johnklemke since this will eventually involve him.

davidskalinder commented 3 years ago

This needs another look now that Solr is dead and gone...

davidskalinder commented 3 years ago

Hey how about that we already have an SQL script that dumps the whole article-metadata table.

@johnklemke, I think you'll probably have to be the one that can assess whether this file will work for getting the articles (with full text and all the important metadata fields) into pass 2? I've put a copy of this file at gdelt/Skalinder/MAI_exports/bpp_article_metadata_2021-03-15_151712.csv -- that is all the articles that in the Black newspapers database at the moment (so nothing new to pass 2, I think); maybe let me know whether this file will work or if there's something else we need in order to test it?

davidskalinder commented 3 years ago

Okay, back to this one. Here's the report from @olderwoman:

The metadata file below is not properly delimited. No tabs, only comma delimiters, but it is fooled by commas and quotation marks within text fields. But it can be parsed as there are some regularities to it that can be exploited to break the text up properly. I could write Stata code to parse it, unless somebody wants to use a different programming language that is more suited to parsing text. ... The articles need to be imported into the Access data base. If you look at the parsing, you will see that the initial columns parse fine, but the full text of the article gets chopped up every time there is a comma. It is a conceptually simple problem. But if you work with John you should be able to determine the best way to reformat the file for import. It would also help to give things column headers, presumably headers that are compatible with what has been done before.

So yeah I should snoop around our other export queries and see if we have some formatting handy that we know MS Access will like.

davidskalinder commented 3 years ago

All righty, after a tedious amount of trial and error, I think escaping within-field " characters so they're written as "" solves everything: there are now files in gdelt/Skalinder/MAI_exports for each of our three data-containing deployments, all of which I can successfully import into MS Access using the "Get External Data - Text File" wizard with the Date Order set to "YMD" and "First Row Contains Field Names" checked (since I also put column names into the file).

I don't think any of our articles currently contain any newline characters (MPEDS replaces them with <br/>), and so I'm not 100% sure whether these exports will still work if a newline ever creeps in. It seems like it should work, since the fields are all enclosed by "s and internal newlines should therefore be ignored; but Access's importer is picky, idiosyncratic, and poorly-documented, so I don't know for sure. The article "meta"data changes rarely enough that I think it's better to wait until it breaks and fix it at that point than try to anticipate all the possible ways it could go wrong.

So @olderwoman / @johnklemke, if y'all can confirm that the new files can get to and look good in where they need to be (presumably the pass 2 DB), then I think we can close this issue? But of course let me know if any new problems arise.