better way of tagging parsed / unparsed files than directories xml_added and xml_queue?

JonathanReeve / sanger

Margaret Sanger Papers Project Search Engine

0 stars 3 forks source link

better way of tagging parsed / unparsed files than directories xml_added and xml_queue? #63

Open JonathanReeve opened 9 years ago

JonathanReeve commented 9 years ago

The current parsing system is a little cumbersome, especially from the point of view of version control. Moving files back and forth between these directories may create lots of commit noise.

CathyHajo commented 9 years ago

We don't generally do this anymore, since the parse program broke, we just put them in xml_added.

JonathanReeve commented 9 years ago

Since the parsing engine is working now (parse2.php, though, and strangely, not parse.php), it's probably a good idea to move edited files to xml_queue so that they can be parsed.

Ideally, there could be a better organizational model that might look something like this:

break out all XML files to a separate repository that exists as a submodule to the main repo. All XML files can live in a single directory there.
the GitHub pull script from #65 can automatically copy changed files to xml_queue, or trigger some other sort of parsing (see #72).

Or one or more of these:

on the parse page, show all XML files as options for parsing, with the most recently changed ones at the top.
listen closely to the payload from GitHub and to which files have been modified (if that comes through in the JSON). Automatically parse those files.
scrap parsing altogether and dynamically XSLT the XML files

CathyHajo commented 9 years ago

Should I be deleting all the files out of the xml_queue once I have committed the changes and synced? What I have been doing is editing in xml_added and then copying the file to xml_queue and commit/sync. Then parsing. It seems to be that this doesn't make a lot of sense, I like your idea of having one directory where all the files live--like our XML drafts was on Dropbox. Will the Git pull ignore backup files? Because they change every time the main file changes and I have to uncheck the boxes not to copy them to the site.

JonathanReeve commented 9 years ago

I think what you're doing sounds fine. You might just have to log into the server directly and pull in changes manually each time you parse, until I can find a better solution. You don't need to delete anything.

Yep, having two directories as a way of keeping track of which files have been parsed is not ideal. I'll look into having a database table for keeping track of which files have been parsed, or some other way of keeping track.

JonathanReeve commented 9 years ago

Re: .bak files (and .bak.bak files), I guessed that those weren't files that needed to be on the production server, so I just added them to .gitignore so that they won't wind up being committed and parsed.