Allow import in place from command-line to use existing files

aristippe commented 8 years ago

Currently implemented in replacement of book_file as FileField though it would be nice to allow command-line import in place while remaining compatible with older versions. Perhaps a second field in Book model for file path.

sinergatis commented 8 years ago

An update, consequence of #10. At this moment:

Book.book_file is a FileField. It "contains" the actual .epub file at static_media/books/xxxxxx.epub.
Book.original_path is a CharField. It contains:
- the full path of the "original" file if imported via addepup (/home/user/some/example/folder/xxxx.epub).
- the filename of the "original" file if imported via web interface (as there is no real path there, it comes from the web) (xxxxx.epup).

original_path is not really used at the moment for downloading, etc: it is only displayed on the book description view.

This said, looking into mechanisms for avoiding copying the existing files makes sense in some scenarios, and something that it's worth looking into. A possible solution might involve symbolic links.

aristippe commented 8 years ago

I'm not sure how others are commonly deploying pathagar, though it seems good to make it an optional parameter in addepub, and possibly addbooks.py, which I haven't used but I believe imports through a CSV or JSON. That'll keep the behavior default and provide an option for those that want it.

sinergatis commented 8 years ago

Yeah, I also feel it might be a decision best left to the user (probably via a setting, or parameters to the importers). Which incidentally reminded me to create #14!

sinergatis commented 8 years ago

Trying to refine a bit the requirements for this issue, could you please let me know if in your scenario, when an epub /home/my/epubsfolder/greatbook.epub is imported via the command line:

having the original_path (/home/my/epubsfolder/greatbook.epub) displayed on the UI is required and desirable?
placing a symbolic link to the file on static-media/books (/path/to/pathagar/static-media/book/greatbook.epub) is acceptable?
retaining the same filename as the original file when the user downloads (ie. user clicks in downloads, gets prompted to download greatbook.epub is required and desirable?

It mostly is for trying an approach based on symbolic links, as it will probably require minimal changes and reuse most of the Django storage and media stuff. EDIT Note: this would technically not be an in place import - it will basically take advantage of the OS facilities for transparently handling the issue.

The third one is not strictly relevant to this issue, but basically for gathering information for a future discussion on standardizing the epub file names internally to something generic and consistent (which in turn might solve any weird-file-name or related encoding issues, which are probably still present).

aristippe commented 8 years ago

Thank you for taking some time to take a look!

• original_path in the UI isn't really required but perhaps useful. I added it in the book_detail template, shown for admins, mainly for debugging. It could be useful for management for some reason, for instance if file locations change later, or in the future one wants to batch search and edit based on path. If there's any foreseeable security consequences of leaving, it's not required. Would be nice for some of the above reasons, but not essential. If it complicates design, don't worry about it. It's stored so far as I can tell, it can simply added to the template later. • Symbolic link is perhaps preferable to not have, and hopefully that doesn't make it much more complicated. I was thinking if FileField was an optional, and then testing if it's null, use original_path, does that make it simpler? • Original file name: I haven't thought about that much. If the original path is stored, then later changes like the download either passing file name or a new one generated by "title - author" (we'd need to validate valid file names), could they be optional and dealt within later if anyone decides they want that. So far, I'm not worried about having original file name exposed since the path isn't known to users.

sinergatis commented 8 years ago

original_path in the UI isn't really required but perhaps useful. I added it in the book_detail template, shown for admins, mainly for debugging.

Ah, didn't realize it was only shown if user was admin! I see the point, and yeah, probably not harmful.

Symbolic link is perhaps preferable to not have, and hopefully that doesn't make it much more complicated. I was thinking if FileField was an optional, and then testing if it's null, use original_path, does that make it simpler?

Hmmm, I would argue that having a symbolic link actually makes things simpler, as Django would for all the purposes treat the symbolic link as any other file (except for when adding via the command line), and in turn all features that deal with files (downloading, editing in the admin, getting filenames, etc) would require no changes.

Could you please elaborate on why you think it is preferable not to have symbolic links?

Original file name: I haven't thought about that much

Me neither :tropical_drink: ! As mentioned, we can deal with that in a new, low priority issue some time in the future, so let's skip that point and focus on the others for now.

aristippe commented 8 years ago

… a symbolic link actually makes things simpler, as Django would for all the purposes treat the symbolic link as any other file (except for when adding via the command line), and in turn all features that deal with files (downloading, editing in the admin, getting filenames, etc) would require no changes.

Initially it seemed cleaner to me, as it would make some aspects of management easier. If someone moved the server, or changed folder locations, for instance merging folders or adding a drive, there'd be one less thing to change, if much later there's some superuser/admin abilities to change path. Though in that case, the symlink seems like it could be redone during edit. How often I or someone else might do that or something else, who knows, maybe not often. It just seemed nicer to not have the symlink there if it isn't necessary.

If it were the case that importing thru web fills out FileField and leaves original_path null, while via the command line w/parameter does vice versa, and it's a few tests here and there to pass whichever field isn't null, then it'd be preferable. if it's some subclassing and it gets to be a fair amount of effort, go with the symlinks, and I can figure the rest out someday whenever I really get the urge. The basic ability would be good enough and wonderful!

sinergatis commented 8 years ago

I see your point, and while I agree that it might probably place some extra burden on the "server admin", I'm feeling that maybe we are over-thinking it and it would be a rather edge case! One could argue that if the server admin sees the need for making changes to the original folders he can use the standard "just copy the files" approach and keep both sets of files separated, and if for any reason (extremely large library size, etc) needs to keep the original folder in sync with the Django DB he will basically assume that some batch manual editing be needed anyway (to the DB if we don't use symbolic links, and to the symlinks if we do).

In regards to the implementation, I still feel that making sure that we always have a FileField has many bonuses and is probably the cleanest way to go - instead of having to make an explicit if on every function that deals with the epub file (granted that they are not that many at the moment), we just keep it nicely isolated on a subclass/whatever, rely on the Django goodies, and forget about it entirely! I feel it basically gives us quite a bunch of peace of mind of sorts, preventing mistakes and making looking at the code easier.

Here is one idea: since we have duplicate checking via SHA256 in place, and the addepub command working, introducing some kind of resync command might help a lot in regards to "server folder editing" and make the symlink vs. no-symlink issue less relevant: it would basically work the same as addepub, traversing some folder(s), parsing files, and if a duplicate is found on the DB just update the relevant file/path for the Book.

In short, what if I try to come up with a temptative symbolic-link based solution, along with the resync command, and if we see that it causes trouble we can revert or improve in future iterations? As you mention, basically what matters is having the basic ability!

aristippe commented 8 years ago

Indeed modifying the DB would be needed if folder locations change. It could be needed a bit more often, let's say if I moved my Calibre library. Perhaps a rare thing since I've only done that once but who knows. Moving folders, and reorganizing a collection by subject may happen. For those that use it as some type of public server, that might take place less often. With me, I don't know yet. Without the symlinks, that'd save an extra step.

Agreed the FileField should be there. Do you see anything wrong with it possibly being null? As far as I can tell, abstracting the file to a function, e.g. get_epub_file, seems safer and may be better anyway since field names might change and may be a bit easier to refactor. If it's that simple, let's go with that, otherwise, sure, it can be looked at again later, and having the basic ability would be all I need. :)

The idea of addepub updating the file sounds good too. I was thinking that might useful someday as well! Overall, whatever looks like the most elegant and simple method to maintain, and simple to code for now is all good.

Thanks again for the effort! I was a bit lost at first since I was going thru the subclass method which I couldn't figure out. Once the basics are in place, it'll be easier for me to look at, hack around, and help out. Whatever you choose, I'm sure it'll be great.

sinergatis commented 8 years ago

Do you see anything wrong with it possibly being null? As far as I can tell, abstracting the file to a function, e.g. get_epub_file, seems safer and may be better anyway since field names might change and may be a bit easier to refactor.

It's quite common to have nullable FieldFiles (for example, an optional profile picture for a user), there is nothing against that! One of their main niceness is that they actually do provide that abstraction layer you mention - you just fetch the file by the field name, or the filename by field.path, etc. And in this particular application I feel it is a good idea to enforce that "one Book<->one file" (at least until we revise the multiple formats, etc stuff) at that level - it seems the logical choice to be, and again, it will probably save us a lot of complexity down the road. We can play around with a testing branch if needed trying your approach in any case!

Thanks again for the effort! I was a bit lost at first since I was going thru the subclass method which I couldn't figure out. Once the basics are in place, it'll be easier for me to look at, hack around, and help out.

No problem, and you were on the right track with the subclassing! I'm shaping up the solution but it is quite related to the info you previously found. And those are mechanisms rather deep inside Django, and definitively not something the regular newish-django-developer usually starts with: I'm actually quite impressed with how quickly you caught up!

aristippe commented 8 years ago

The PR looks fantastic. That you were able to take care of it so quickly is wonderful!

you were on the right track with the subclassing! I'm shaping up the solution but it is quite related to the info you previously found. And those are mechanisms rather deep inside Django, and definitively not something the regular newish-django-developer usually starts with: I'm actually quite impressed with how quickly you caught up!

That was pretty much the first thing I looked into it and seeing how others approached it, the idea of subclassing to modify default behavior made sense is one thing, but knowing how to do it is certainly another! :)

sinergatis commented 8 years ago

With the latest pull requests, hopefully we can consider this issue close - please feel free to do so after testing it a bit to make sure it matches your needs, and close it if that is the case, or suggest more modifications if not! Me likes closed issues :closed_lock_with_key:

aristippe commented 8 years ago

Yep, everything looks great and works perfectly. Thank you!

aristippe / pathagar

Allow import in place from command-line to use existing files #1