Determine how to provide full data

shukryzablah commented 4 years ago

The etl framework does seem like a good idea.

However I think the first step is to have the files online and a function in the package that downloads them to a directory. This could end up being our etl_extract in the long run and is helpful to have this by itself too.

The data should be public and the Box folder will not be the most friendly to pull data from. Should we host it ourselves?

nicholasjhorton commented 4 years ago

I've asked IT about how best to make the data available (as I agree that the current setup is insufficient).

nicholasjhorton commented 4 years ago

The full data will be accessible via nhorton.people.amherst.edu/ValleyBikes

Should I just add in the files from Box?

shukryzablah commented 4 years ago

Yes, but it would be better to host compressed versions of all the files.

shukryzablah commented 4 years ago

In my computer I downloaded the ~300MB zip file from Box and did:

unzip ValleyBike.zip
cd ValleyBike
gzip *

Can you then move the whole folder to your server?

nicholasjhorton commented 4 years ago

I will make the available as https://nhorton.people.amherst.edu/valleybikes

nicholasjhorton commented 4 years ago

Done.

But I now wonder if we shouldn't just include all of the compressed daily files in extdata? Is that what you are thinking?

shukryzablah commented 4 years ago

I was thinking that could bloat the package. All the compressed files amount to 333MB, with the largest being 4MB but and the median being <1MB. I think it would be unnecessary coupling of data and package, and by providing files online we can update with more data without uploading package, as well as share the link to people that want to bypass R.

It would be slightly simpler to implement, but the interface of the package would still be similar.

On Tue, Nov 12, 2019 at 8:04 AM Nicholas Horton notifications@github.com wrote:

Done.

But I now wonder if we shouldn't just include all of the compressed daily files in extdata? Is that what you are thinking?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/Amherst-Statistics/ValleyBikes/issues/11?email_source=notifications&email_token=AG3OAITTTNU3VEJVMC37KN3QTKSVHA5CNFSM4JFNS2O2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOED2FRRI#issuecomment-552884421, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG3OAIXMT47KYEMTW6SXO73QTKSVHANCNFSM4JFNS2OQ .

nicholasjhorton commented 4 years ago

Great. Let's stick with the plan to have the files online and not bloat the package.

shukryzablah commented 4 years ago

Trying to download a file from https://nhorton.people.amherst.edu/valleybikes/ gives a 403 error (forbidden). In both the browser and through R.

nicholasjhorton commented 4 years ago

Apologies. I had set the wrong umask. I've changed the protections and this should now be working.

Amherst-Statistics / ValleyBikes-obsolete

Determine how to provide full data #11