Open sinergatis opened 8 years ago
Makes sense. I do now wonder if there's any reason to preserve original metadata as is, for certain fields. Not sure yet but it is a question.
Related to the issue of logging, perhaps command-line import can have optional logging output to a file, maybe even specifying fields, so it'd be easier to see how particular fields are handled.
I'm tempted to do a "small-ish" models pull request, containing this change and a couple of other minor ones (add uploader to book, revise Language
fields and handling).
Could you at some point dump the dc_issued
column from your DB (or alternatively compile them during an addepub import of sorts?), and include it here (or on private, etc) to assess how close the epubs adhere to the standard?
You're recent interest has been incredible! :) Thanks for all the contribs in the last week or so. Sorry I haven't been able to reply to everything. I'll do my best soon. Part of it has been time recently, and part of it devoting part of my time to another app which I see as a companion to Pathagar.
Attached is the dc_issued
data for about 10,000 books. I had thought too about the column when I was renaming some of the fields like title, authors, publishers, and summary to be less ePub-centric since one can add a book of any format. date_issued
or date_published
is possibly a good name. Good news is that formats are mostly in spec, either YYYY
or YYYY-MM-DD
. Cases of YYYY-MM-DDTHH:MM-…
can be stripped of time, while cases of NONE
or spaces can be ignored. Some occasional non-spec formats such as 05-May-2009
could be parsed, while from others like MM-DD-YYYY
or DD-MM-YYYY
with slashes or dashes since they are ambiguous, we can take the year. For the other minority cases, perhaps taking the first 4 digit number would work.
Languages
I've thought about it a bit though haven't tried to sit down and see how exactly it would work. What seems good is a M:M
field that would perhaps contain only full name language strings, that's somehow mapped to an EpubLanguages
model that would group together language codes and variations to Languages
with a view to allow new ones to be to mapped and edited for language name. Anything you can figure out would be great!
No worries! I'll probably have to slow down a bit myself towards the end of the week/next week, so fully understand your situation - I'm mainly just cleaning things up and using the issue tracker to dump my notes and thoughts, making it easier to retake the work later. There is plenty of stuff we can take care of without requiring urgent communication, so don't feel on a rush at all!
This said, thanks for the awesome data!
Attached is the dc_issued data for about 10,000 books (...)
Great! There seems to be some deviations from the standard indeed, but - I'll check them in detail and play around with them, but it looks we can tackle most of them trying them to match some "patterns" of sorts and at least extract some info in ambiguous cases. Great news!
Languages I've thought about it a bit though haven't tried to sit down and see how exactly it would work. What seems good is a M:M field that would perhaps contain only full name language strings, that's somehow mapped to an EpubLanguages model that would group together language codes and variations to Languages with a view to allow new ones to be to mapped and edited for language name. Anything you can figure out would be great!
Hmmm, for the moment my idea was to focus on "retiring" the book/langlist.py
approach in favor of using proper, standard language identifiers - if they are guaranteed to be unique and standard, I feel this would make them ideal candidates for the main field in the Language
model. Once that is in place, using an "Alias" table of sorts (similar to the authors problem) or adding extra fields to the model or mapping things around should be mainly a matter of preference: I'll give it some more thought and comment on the languages issue!
I'm tagging this as "later", as there is a hairy issue to solve: how to store "fuzzy" dates, such as dates that only specify a year (and not month and day) gracefully. There are a number of python solutions (and probably Django specific as well - rolling our own shouldn't also be too hard), but I'm not sure at the moment what would be the best choice: what exactly do we want to do with the dates (other than storing them)? If they are just for displaying on the book details page, going for the simplest option makes sense, but if there are plans to do something else with them (such as providing a feed or view sorted by publication date, etc) some more thought would be needed.
yeah not sure. Even if we stuck with a TextField
, wouldn't that still be sortable? For future issues like a possible search by date before/after or range, maybe text would still work. don't know.
Book.dc_issued
is currently set to aCharField
, but it should probably make sense to store it as a proper date/time field. A look at the standards and the usual "try to find out if epubs really follow them" would be needed:http://www.idpf.org/epub/30/spec/epub30-publications.html#sec-opf-dcmes-optional
http://www.w3.org/TR/NOTE-datetime