Create target subdirs if they don't exist

ulikoehler commented 11 years ago

I tried to harvest the arXiv metadata, but it failed because it was trying to open a file in a subdirectory that did not exist.

This changes the directory creation behaviour to mkdir -p all dirs in os.path.dirname(fp).

Additionally, it partially works around the problem that the user might use an existing file as -d argument or a target dir might already exist as file (now os.path.isdir() is used instead of os.path.exists() is used).

bloomonkey commented 11 years ago

Firstly, sorry for the delay in responding to this request - I was away and then catching up with other projects.

In it's current form, this would result in nested directories for identifiers that contain slashes. Is that what was intended?

My instinctive preference would be to escape the slashes under such circumstances, but maybe you can persuade me that creating the sub-directories would be the best option for the majority of use cases? I'd be interested to hear the opinions/votes of any others watching this project...

Cheers, John

ulikoehler commented 11 years ago

Hi John! Thanks for your reply & sorry for my late answer!

I think your suggestion is reasonable, it should work, but filenames miht be nicer when creating subdirs. Maybe an option is the way to go here?

Let me tell you about my usecase (where I encountered this bug): I tried to download the Computer Science ArXiv OAI PMH data (http://arxiv.org/help/oa/index):

oai-harvest -p arXivRaw -s cs -d cs http://export.arxiv.org/oai2

After ~51k successful downloads this error occurs:

DEBUG    Writing to file /home/uli/dev/faeb57a5cd4cda03909d/cs/oai:arXiv.org:1310.2886.arXivRaw.xml
DEBUG    Writing to file /home/uli/dev/faeb57a5cd4cda03909d/cs/oai:arXiv.org:1310.2916.arXivRaw.xml
DEBUG    Writing to file /home/uli/dev/faeb57a5cd4cda03909d/cs/oai:arXiv.org:1310.2923.arXivRaw.xml
DEBUG    Writing to file /home/uli/dev/faeb57a5cd4cda03909d/cs/oai:arXiv.org:1310.2928.arXivRaw.xml
DEBUG    Writing to file /home/uli/dev/faeb57a5cd4cda03909d/cs/oai:arXiv.org:1310.2931.arXivRaw.xml
DEBUG    Writing to file /home/uli/dev/faeb57a5cd4cda03909d/cs/oai:arXiv.org:adap-org/9807003.arXivRaw.xml
ERROR    [Errno 2] No such file or directory: '/home/uli/dev/faeb57a5cd4cda03909d/cs/oai:arXiv.org:adap-org/9807003.arXivRaw.xml'

Obviously the directory isn't created automatically, which is what I intended to fix with this pull request. There are two roughly equivalent possibilities:

As suggested by you, write the file to oai:arXiv.org:adap-org-9807003.arXivRaw.xml
As implemented by this pull request, mkdir oai:arXiv.org:adap-org and save 9807003.arXivRaw.xml there.
Just save as 9807003.arXivRaw.xml and hope it doesn't overwrite something. IMO not really an option.

I think there are some pretty weak advantages when using the second option (or providing an option to do so):

For some of my usecases, multiple millions of records will be downloaded (ArXiv is a pretty good example, e.g. when downloading the physics set). Some FSs degrade in respect to performance if multiple millions of files are saved in the same directory, and even if not, tools like ls or the Python equivalents are usually quite slow. It seems to be a good idea to reduce the number of files in a single directory.
It can be assumed the OAI PMH administrator packed some meaninful information in the ID and it therefore might be a good idea to keep the filenames consistent. In the case of ArXiV, there would be two filename formats, 9807003.arXivRaw.xml and oai:arXiv.org:adap-org-9807003.arXivRaw.xml
Items in sub-directories might contain different information formats or types (e.g. additional XML fields), and it would be difficult to separate the records afterwards

I'm not really into the OAI PMH protocol and format yet, so please feel free to correct me if some of the information above is incomplete or incorrect!

Thanks again & best regards, Uli

bloomonkey commented 11 years ago

I can see the argument for sub-directories, but at the same time I'm cautious about assuming that any / character in an identifier was intended to represent a sub-directory.

I think that you're right about making this an option. I will look at incorporating this in the next few days.

Items in sub-directories should not contain different formats - these should be represented by different metadata formats in OAI-PMH, that can be retrieved by specifying the appropriate metadataPrefix value.

All the best, John

atomotic commented 11 years ago

sorry if i'm late and the ticket is closed. i think that --create-subdir feature is useful when harvesting multiple repositories, but i would like to define the separator with another command flag

note that eprints repositories for example publish identifiers in this form: oai:baseurl:eprintID ie. -> oai:amsacta.cib.unibo.it:3828.oai_dc.xml

so in this case a ":" separator for subdir would be better

ulikoehler commented 11 years ago

@atomotic I would intuitively start multiple harvester processes when harvesting multiple repos. Is there any specific reason why you want to harvest all in one process / harvester call? With ArXiv, i have a similar usecase, so it would be useful to know if it has any major advantages.

Thanks a lot @bloomonkey for your support and for implementing it! Of course you're right about the metadata format.

bloomonkey commented 11 years ago

I guess I could make the option accept a value on which to create sub-directories? i.e. this value would be replaced with the OS path separator immediately before creating the target sub-directories. I probably don't have time to add this today, but let me know what you think. In fact, as this pull request has now closed, maybe @atomotic could start a new one?

Thanks both @atomotic and @ulikoehler for feedback and contributions.

bloomonkey / oai-harvest

Create target subdirs if they don't exist #5