amchagas / open-hardware-supply

having a closer look on how OSH papers are evolving over time
MIT License
5 stars 2 forks source link

next steps (10/04/2024) #19

Open amchagas opened 2 months ago

amchagas commented 2 months ago
amchagas commented 2 months ago

Did a bit of research on data storage, seems a good combo would be GIN and Datalad... https://gin.g-node.org/ as they both use git-annex and are open source. GIN has unlimited space, allows private (if we want to have a PDF repository for instance) and public repos, and is hosted in Germany.

To be fair, there seem to be a large number of data storage solutions out there (OSF, figshare, etc), some are open source and have version control, and deal with data as living, evolving "objects". I couldn't see right away big differences between some of them for our use case (some have 50gb limitation per project for instance). but seems this combo mentioned above is a good one... happy to start setting things up or discussing further, as needed.

solstag commented 2 months ago

Can you remind me what and where is the problem with requests × curl ?

amchagas commented 2 months ago

from what I could gather, requests throws a time out error when it hits a website that has a cookie policy pop-up (one of those that require users to agree/select/deselect what cookies are allowed). Using Curl directly on the command line seems to not to have that problem... So an easy test would be to put code in place that calls curl from within python..

solstag commented 2 months ago

Are these lines all that is concerned or is there something somewhere else?

https://github.com/amchagas/open-hardware-supply/blob/main/code/method2-scholarly/4_grab_zotero_metadata.py#L112-L113

By the way, why do you a get() and then replace the result with a post()?

amchagas commented 2 months ago

actually the commented line below (115) is the correct one... those other two are just me being not careful enough to clean up my code before pushing to remote... those were me testing things manually...

solstag commented 2 months ago

With afa6c32 and 5bb1867 the analysis are much better structured and can now be easily tested on any dataset.

solstag commented 3 weeks ago

Regarding data storage, I've turned our repos into a regular git code repo and a datalad git-annex repo. I'll soon push the results, but instead of hosting data on github with LFS (which seems to be quirky) we'll be hosting it on GIN, which seems to be well integrated with datalad.

solstag commented 3 weeks ago

Looks promising... : )

https://gin.g-node.org/solstag/open-hardware-supply-data

In theory you just have to datalad clone -d . git@gin.g-node.org:/solstag/open-hardware-supply-data.git data from the root of the code repo, and we're back to where we started. Only figures will have moved to data/figures, so some code changes are due. Will see that later before pushing the code repo. :)

This datalad cheat sheet is helpful.

solstag commented 3 weeks ago

Ok, just noticed the output directory was already parametrized, so it's actually good to go. I've just pushed the changes :D