#126 fix - dhs double zipping

ajdamico / lodown

locally download and prepare publicly-available microdata

GNU General Public License v3.0

97 stars 47 forks source link

#126 fix - dhs double zipping #127

Closed majazaloznik closed 6 years ago

majazaloznik commented 6 years ago

OK, here it is. This extracts all .zip files if sub zipfiles exist. It also adds a row to the catalog, and even gives a semi-correct message at the end of each file extraction - the only issue there is that you cannot know in advance if you have one of these files in your catalog, so it is impossible to always correctly know how many entries there will be in total until the end.

Other than that it's great. :) Mind you, this still won't work on Windows. I think that could be solved actually just by extracting into a different folder than the download folder. But I can't test it now, but can check back in a few weeks when i get back to my work computer.

And thanks for this opportunity, I learned a tonne! (sorry about the spacing, I have no idea what is going on there!)

ajdamico commented 6 years ago

thanks so much for your thoughtful edits. i need to rewrite so the catalog is read-only..

majazaloznik commented 6 years ago

Sure thing! I still think my first suggestion (the line I wrote in #126) is the better option--only unzip the named file and ignore the other one that is duplicate anyway, also keeps the catalog intact. A lot cleaner and less kludgy.

ajdamico commented 6 years ago

does that behave differently than

https://github.com/ajdamico/lodown/commit/bca20c3087fe3bdd3ea96c8f0f0281eb81365aee

? thanks!

On Thu, Dec 21, 2017 at 9:11 AM, maja notifications@github.com wrote:

Sure thing! I still think my first suggestion (the line I wrote in #126 https://github.com/ajdamico/lodown/issues/126) is the better option--only unzip the named file and ignore the other one that is duplicate anyway, also keeps the catalog intact. A lot cleaner and less kludgy.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/ajdamico/lodown/pull/127#issuecomment-353360179, or mute the thread https://github.com/notifications/unsubscribe-auth/AANO56umFEZnukFgR3bkPgUDinfNAELCks5tCmcJgaJpZM4RJO1C .

majazaloznik commented 6 years ago

yes, (i tested both as well) edit: they weren't clean tests I just realised (i was fiddling with the catalog) but did a clean one now and still no joy.

ajdamico commented 6 years ago

hi, the catalog insertion is the problem -- i'd rather remove the output_filename column and just store everything into the output_directory that's available, and not insert an additional row into the catalog. i'll look closer at this next year

majazaloznik commented 6 years ago

yeah - but think about it this way: why do you need a catalog insertion? because they added an extra file that isn't in the catalog - but that the user can download separately and merge with this one if they want. so i say just ignore the extra file. keep the integrity of the catalog, all the data is there anyway?

ajdamico commented 6 years ago

if you think all multiple-file downloads from dhs are mistakes, i agree you can just add a hardcoded removal! thanks

majazaloznik commented 6 years ago

i am sure they would find the word mistake a bit strong, but it is still inexplicable to me why they did it. the files that i checked perfectly duplicate the data already available in other files. (i can't check them all right now, i'm working off mobile data :)

ajdamico commented 6 years ago

i've made a bunch of changes so closing but can reopen if i'm being stupid. thanks

majazaloznik commented 6 years ago

thanks, i'll have a look asap. hope you had a good break ;)

On 10 January 2018 at 22:18, Anthony Damico notifications@github.com wrote:

i've made a bunch of changes so closing but can reopen if i'm being stupid. thanks

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ajdamico/lodown/pull/127#issuecomment-356739784, or mute the thread https://github.com/notifications/unsubscribe-auth/AHaQrSD7zh-HD5rk88BeajUB_T2uDDj3ks5tJSktgaJpZM4RJO1C .