bloomonkey / oai-harvest

Python package for harvesting records from OAI-PMH provider(s).
Other
62 stars 41 forks source link

Create target subdirs if they don't exist #5

Closed ulikoehler closed 11 years ago

ulikoehler commented 11 years ago

I tried to harvest the arXiv metadata, but it failed because it was trying to open a file in a subdirectory that did not exist.

This changes the directory creation behaviour to mkdir -p all dirs in os.path.dirname(fp).

Additionally, it partially works around the problem that the user might use an existing file as -d argument or a target dir might already exist as file (now os.path.isdir() is used instead of os.path.exists() is used).

bloomonkey commented 11 years ago

Firstly, sorry for the delay in responding to this request - I was away and then catching up with other projects.

In it's current form, this would result in nested directories for identifiers that contain slashes. Is that what was intended?

My instinctive preference would be to escape the slashes under such circumstances, but maybe you can persuade me that creating the sub-directories would be the best option for the majority of use cases? I'd be interested to hear the opinions/votes of any others watching this project...

Cheers, John

ulikoehler commented 11 years ago

Hi John! Thanks for your reply & sorry for my late answer!

I think your suggestion is reasonable, it should work, but filenames miht be nicer when creating subdirs. Maybe an option is the way to go here?

Let me tell you about my usecase (where I encountered this bug): I tried to download the Computer Science ArXiv OAI PMH data (http://arxiv.org/help/oa/index):

oai-harvest -p arXivRaw -s cs -d cs http://export.arxiv.org/oai2

After ~51k successful downloads this error occurs:

DEBUG    Writing to file /home/uli/dev/faeb57a5cd4cda03909d/cs/oai:arXiv.org:1310.2886.arXivRaw.xml
DEBUG    Writing to file /home/uli/dev/faeb57a5cd4cda03909d/cs/oai:arXiv.org:1310.2916.arXivRaw.xml
DEBUG    Writing to file /home/uli/dev/faeb57a5cd4cda03909d/cs/oai:arXiv.org:1310.2923.arXivRaw.xml
DEBUG    Writing to file /home/uli/dev/faeb57a5cd4cda03909d/cs/oai:arXiv.org:1310.2928.arXivRaw.xml
DEBUG    Writing to file /home/uli/dev/faeb57a5cd4cda03909d/cs/oai:arXiv.org:1310.2931.arXivRaw.xml
DEBUG    Writing to file /home/uli/dev/faeb57a5cd4cda03909d/cs/oai:arXiv.org:adap-org/9807003.arXivRaw.xml
ERROR    [Errno 2] No such file or directory: '/home/uli/dev/faeb57a5cd4cda03909d/cs/oai:arXiv.org:adap-org/9807003.arXivRaw.xml'

Obviously the directory isn't created automatically, which is what I intended to fix with this pull request. There are two roughly equivalent possibilities:

I think there are some pretty weak advantages when using the second option (or providing an option to do so):

I'm not really into the OAI PMH protocol and format yet, so please feel free to correct me if some of the information above is incomplete or incorrect!

Thanks again & best regards, Uli

bloomonkey commented 11 years ago

I can see the argument for sub-directories, but at the same time I'm cautious about assuming that any / character in an identifier was intended to represent a sub-directory.

I think that you're right about making this an option. I will look at incorporating this in the next few days.

Items in sub-directories should not contain different formats - these should be represented by different metadata formats in OAI-PMH, that can be retrieved by specifying the appropriate metadataPrefix value.

All the best, John

atomotic commented 11 years ago

sorry if i'm late and the ticket is closed. i think that --create-subdir feature is useful when harvesting multiple repositories, but i would like to define the separator with another command flag

note that eprints repositories for example publish identifiers in this form: oai:baseurl:eprintID ie. -> oai:amsacta.cib.unibo.it:3828.oai_dc.xml

so in this case a ":" separator for subdir would be better

ulikoehler commented 11 years ago

@atomotic I would intuitively start multiple harvester processes when harvesting multiple repos. Is there any specific reason why you want to harvest all in one process / harvester call? With ArXiv, i have a similar usecase, so it would be useful to know if it has any major advantages.

Thanks a lot @bloomonkey for your support and for implementing it! Of course you're right about the metadata format.

bloomonkey commented 11 years ago

I guess I could make the option accept a value on which to create sub-directories? i.e. this value would be replaced with the OS path separator immediately before creating the target sub-directories. I probably don't have time to add this today, but let me know what you think. In fact, as this pull request has now closed, maybe @atomotic could start a new one?

Thanks both @atomotic and @ulikoehler for feedback and contributions.