Closed ulikoehler closed 11 years ago
Firstly, sorry for the delay in responding to this request - I was away and then catching up with other projects.
In it's current form, this would result in nested directories for identifiers that contain slashes. Is that what was intended?
My instinctive preference would be to escape the slashes under such circumstances, but maybe you can persuade me that creating the sub-directories would be the best option for the majority of use cases? I'd be interested to hear the opinions/votes of any others watching this project...
Cheers, John
Hi John! Thanks for your reply & sorry for my late answer!
I think your suggestion is reasonable, it should work, but filenames miht be nicer when creating subdirs. Maybe an option is the way to go here?
Let me tell you about my usecase (where I encountered this bug): I tried to download the Computer Science ArXiv OAI PMH data (http://arxiv.org/help/oa/index):
oai-harvest -p arXivRaw -s cs -d cs http://export.arxiv.org/oai2
After ~51k successful downloads this error occurs:
DEBUG Writing to file /home/uli/dev/faeb57a5cd4cda03909d/cs/oai:arXiv.org:1310.2886.arXivRaw.xml
DEBUG Writing to file /home/uli/dev/faeb57a5cd4cda03909d/cs/oai:arXiv.org:1310.2916.arXivRaw.xml
DEBUG Writing to file /home/uli/dev/faeb57a5cd4cda03909d/cs/oai:arXiv.org:1310.2923.arXivRaw.xml
DEBUG Writing to file /home/uli/dev/faeb57a5cd4cda03909d/cs/oai:arXiv.org:1310.2928.arXivRaw.xml
DEBUG Writing to file /home/uli/dev/faeb57a5cd4cda03909d/cs/oai:arXiv.org:1310.2931.arXivRaw.xml
DEBUG Writing to file /home/uli/dev/faeb57a5cd4cda03909d/cs/oai:arXiv.org:adap-org/9807003.arXivRaw.xml
ERROR [Errno 2] No such file or directory: '/home/uli/dev/faeb57a5cd4cda03909d/cs/oai:arXiv.org:adap-org/9807003.arXivRaw.xml'
Obviously the directory isn't created automatically, which is what I intended to fix with this pull request. There are two roughly equivalent possibilities:
I think there are some pretty weak advantages when using the second option (or providing an option to do so):
I'm not really into the OAI PMH protocol and format yet, so please feel free to correct me if some of the information above is incomplete or incorrect!
Thanks again & best regards, Uli
I can see the argument for sub-directories, but at the same time I'm cautious about assuming that any /
character in an identifier was intended to represent a sub-directory.
I think that you're right about making this an option. I will look at incorporating this in the next few days.
Items in sub-directories should not contain different formats - these should be represented by different metadata formats in OAI-PMH, that can be retrieved by specifying the appropriate metadataPrefix
value.
All the best, John
sorry if i'm late and the ticket is closed. i think that --create-subdir feature is useful when harvesting multiple repositories, but i would like to define the separator with another command flag
note that eprints repositories for example publish identifiers in this form: oai:baseurl:eprintID ie. -> oai:amsacta.cib.unibo.it:3828.oai_dc.xml
so in this case a ":" separator for subdir would be better
@atomotic I would intuitively start multiple harvester processes when harvesting multiple repos. Is there any specific reason why you want to harvest all in one process / harvester call? With ArXiv, i have a similar usecase, so it would be useful to know if it has any major advantages.
Thanks a lot @bloomonkey for your support and for implementing it! Of course you're right about the metadata format.
I guess I could make the option accept a value on which to create sub-directories? i.e. this value would be replaced with the OS path separator immediately before creating the target sub-directories. I probably don't have time to add this today, but let me know what you think. In fact, as this pull request has now closed, maybe @atomotic could start a new one?
Thanks both @atomotic and @ulikoehler for feedback and contributions.
I tried to harvest the arXiv metadata, but it failed because it was trying to open a file in a subdirectory that did not exist.
This changes the directory creation behaviour to mkdir -p all dirs in os.path.dirname(fp).
Additionally, it partially works around the problem that the user might use an existing file as -d argument or a target dir might already exist as file (now os.path.isdir() is used instead of os.path.exists() is used).