Handle Epub dates as real dates

sinergatis commented 8 years ago

Book.dc_issued is currently set to a CharField, but it should probably make sense to store it as a proper date/time field. A look at the standards and the usual "try to find out if epubs really follow them" would be needed:

http://www.idpf.org/epub/30/spec/epub30-publications.html#sec-opf-dcmes-optional

The DCMES date Element

The date element must only be used to define the publication date of the EPUB Publication. The publication date is not the same as the last modified date (the last time the content was changed), which must be included using the [DCTERMS] modified property.

For compliance with EPUB 2 Reading Systems, the date string should conform to Date and Time Formats.

http://www.w3.org/TR/NOTE-datetime

aristippe commented 8 years ago

Makes sense. I do now wonder if there's any reason to preserve original metadata as is, for certain fields. Not sure yet but it is a question.

Related to the issue of logging, perhaps command-line import can have optional logging output to a file, maybe even specifying fields, so it'd be easier to see how particular fields are handled.

sinergatis commented 8 years ago

I'm tempted to do a "small-ish" models pull request, containing this change and a couple of other minor ones (add uploader to book, revise Language fields and handling).

Could you at some point dump the dc_issued column from your DB (or alternatively compile them during an addepub import of sorts?), and include it here (or on private, etc) to assess how close the epubs adhere to the standard?

aristippe commented 8 years ago

You're recent interest has been incredible! :) Thanks for all the contribs in the last week or so. Sorry I haven't been able to reply to everything. I'll do my best soon. Part of it has been time recently, and part of it devoting part of my time to another app which I see as a companion to Pathagar.

Attached is the dc_issued data for about 10,000 books. I had thought too about the column when I was renaming some of the fields like title, authors, publishers, and summary to be less ePub-centric since one can add a book of any format. date_issued or date_published is possibly a good name. Good news is that formats are mostly in spec, either YYYY or YYYY-MM-DD. Cases of YYYY-MM-DDTHH:MM-… can be stripped of time, while cases of NONE or spaces can be ignored. Some occasional non-spec formats such as 05-May-2009 could be parsed, while from others like MM-DD-YYYY or DD-MM-YYYY with slashes or dashes since they are ambiguous, we can take the year. For the other minority cases, perhaps taking the first 4 digit number would work.

Languages I've thought about it a bit though haven't tried to sit down and see how exactly it would work. What seems good is a M:M field that would perhaps contain only full name language strings, that's somehow mapped to an EpubLanguages model that would group together language codes and variations to Languages with a view to allow new ones to be to mapped and edited for language name. Anything you can figure out would be great!

epub dc_issued.txt

sinergatis commented 8 years ago

No worries! I'll probably have to slow down a bit myself towards the end of the week/next week, so fully understand your situation - I'm mainly just cleaning things up and using the issue tracker to dump my notes and thoughts, making it easier to retake the work later. There is plenty of stuff we can take care of without requiring urgent communication, so don't feel on a rush at all!

This said, thanks for the awesome data!

Attached is the dc_issued data for about 10,000 books (...)

Great! There seems to be some deviations from the standard indeed, but - I'll check them in detail and play around with them, but it looks we can tackle most of them trying them to match some "patterns" of sorts and at least extract some info in ambiguous cases. Great news!

Languages I've thought about it a bit though haven't tried to sit down and see how exactly it would work. What seems good is a M:M field that would perhaps contain only full name language strings, that's somehow mapped to an EpubLanguages model that would group together language codes and variations to Languages with a view to allow new ones to be to mapped and edited for language name. Anything you can figure out would be great!

Hmmm, for the moment my idea was to focus on "retiring" the book/langlist.py approach in favor of using proper, standard language identifiers - if they are guaranteed to be unique and standard, I feel this would make them ideal candidates for the main field in the Language model. Once that is in place, using an "Alias" table of sorts (similar to the authors problem) or adding extra fields to the model or mapping things around should be mainly a matter of preference: I'll give it some more thought and comment on the languages issue!

sinergatis commented 8 years ago

I'm tagging this as "later", as there is a hairy issue to solve: how to store "fuzzy" dates, such as dates that only specify a year (and not month and day) gracefully. There are a number of python solutions (and probably Django specific as well - rolling our own shouldn't also be too hard), but I'm not sure at the moment what would be the best choice: what exactly do we want to do with the dates (other than storing them)? If they are just for displaying on the book details page, going for the simplest option makes sense, but if there are plans to do something else with them (such as providing a feed or view sorted by publication date, etc) some more thought would be needed.

aristippe commented 8 years ago

yeah not sure. Even if we stuck with a TextField, wouldn't that still be sortable? For future issues like a possible search by date before/after or range, maybe text would still work. don't know.

aristippe / pathagar

Handle Epub dates as real dates #9