datalad / datalad-crawler

DataLad extension for tracking web resources as datasets
http://datalad.org
Other
5 stars 16 forks source link

Pipelines are broken due to merge conflict with .gitattributes #5

Closed mih closed 6 years ago

mih commented 6 years ago

@yarikoptic As discovered by https://github.com/datalad/datalad/pull/2487, the PR https://github.com/datalad/datalad/pull/1597 broke the OpenFMRI pipeline.

Full log: https://travis-ci.org/datalad/datalad/jobs/378021286

And actually more than Openfmri is broken now:

https://travis-ci.org/datalad/datalad-crawler/jobs/378037035

It seems the pipelines (or just the tests) cannot deal with the situation that a dataset already comes with a .gitattributes file.

======================================================================
ERROR: datalad_crawler.pipelines.tests.test_openfmri.test_openfmri_pipeline2
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/mih/hacking/datalad/git/datalad/tests/utils.py", line 493, in newfunc
    return t(*(arg + (d,)), **kw)
  File "/home/mih/hacking/datalad/git/datalad/tests/utils.py", line 604, in newfunc
    return tfunc(*(args + (path, url)), **kwargs)
  File "/home/mih/hacking/datalad/git/datalad/tests/utils.py", line 663, in newfunc
    return t(*(arg + (filename,)), **kw)
  File "/home/mih/hacking/datalad/crawler/datalad_crawler/pipelines/tests/test_openfmri.py", line 514, in test_openfmri_pipeline2
    out = run_pipeline(pipeline)
  File "/home/mih/hacking/datalad/crawler/datalad_crawler/pipeline.py", line 114, in run_pipeline
    output = list(xrun_pipeline(*args, **kwargs))
  File "/home/mih/hacking/datalad/crawler/datalad_crawler/pipeline.py", line 194, in xrun_pipeline
    for idata_out, data_out in enumerate(xrun_pipeline_steps(pipeline, data_in, output=output_sub)):
  File "/home/mih/hacking/datalad/crawler/datalad_crawler/pipeline.py", line 286, in xrun_pipeline_steps
    for data_out in xrun_pipeline_steps(pipeline_tail, data_, output=output):
  File "/home/mih/hacking/datalad/crawler/datalad_crawler/pipeline.py", line 286, in xrun_pipeline_steps
    for data_out in xrun_pipeline_steps(pipeline_tail, data_, output=output):
  File "/home/mih/hacking/datalad/crawler/datalad_crawler/pipeline.py", line 286, in xrun_pipeline_steps
    for data_out in xrun_pipeline_steps(pipeline_tail, data_, output=output):
  File "/home/mih/hacking/datalad/crawler/datalad_crawler/pipeline.py", line 270, in xrun_pipeline_steps
    for data_ in data_in_to_loop:
  File "/home/mih/hacking/datalad/crawler/datalad_crawler/pipeline.py", line 194, in xrun_pipeline
    for idata_out, data_out in enumerate(xrun_pipeline_steps(pipeline, data_in, output=output_sub)):
  File "/home/mih/hacking/datalad/crawler/datalad_crawler/pipeline.py", line 286, in xrun_pipeline_steps
    for data_out in xrun_pipeline_steps(pipeline_tail, data_, output=output):
  File "/home/mih/hacking/datalad/crawler/datalad_crawler/pipeline.py", line 286, in xrun_pipeline_steps
    for data_out in xrun_pipeline_steps(pipeline_tail, data_, output=output):
  File "/home/mih/hacking/datalad/crawler/datalad_crawler/pipeline.py", line 286, in xrun_pipeline_steps
    for data_out in xrun_pipeline_steps(pipeline_tail, data_, output=output):
  [Previous line repeated 2 more times]
  File "/home/mih/hacking/datalad/crawler/datalad_crawler/pipeline.py", line 270, in xrun_pipeline_steps
    for data_ in data_in_to_loop:
  File "/home/mih/hacking/datalad/crawler/datalad_crawler/nodes/annex.py", line 828, in merge_branch
    self.repo.merge(to_merge, options=options, **merge_kwargs)
  File "/home/mih/hacking/datalad/git/datalad/support/gitrepo.py", line 1951, in merge
    **kwargs
  File "/home/mih/hacking/datalad/git/datalad/support/annexrepo.py", line 1207, in _git_custom_command
    return super(AnnexRepo, self)._git_custom_command(*args, **kwargs)
  File "/home/mih/hacking/datalad/git/datalad/support/gitrepo.py", line 301, in newfunc
    result = func(self, files_new, *args, **kwargs)
  File "/home/mih/hacking/datalad/git/datalad/support/gitrepo.py", line 1576, in _git_custom_command
    expect_fail=expect_fail)
  File "/home/mih/hacking/datalad/git/datalad/cmd.py", line 668, in run
    cmd, env=self.get_git_environ_adjusted(env), *args, **kwargs)
  File "/home/mih/hacking/datalad/git/datalad/cmd.py", line 528, in run
    raise CommandError(str(cmd), msg, status, out[0], out[1])
datalad.support.exceptions.CommandError: CommandError: command '['git', '-c', 'receive.autogc=0', '-c', 'gc.auto=0', 'merge', '--allow-unrelated-histories', 'incoming-processed']' failed with exitcode 1
Failed to run ['git', '-c', 'receive.autogc=0', '-c', 'gc.auto=0', 'merge', '--allow-unrelated-histories', 'incoming-processed'] under '/tmp/datalad_temp_test_openfmri_pipeline28q2wfff1'. Exit code=1. out=Auto-merging .gitattributes
CONFLICT (add/add): Merge conflict in .gitattributes
Automatic merge failed; fix conflicts and then commit the result.
 err=

Also FYI @bpoldrack

yarikoptic commented 6 years ago

looking into it. I think I will just make crawler to default to start new branches (incoming, incoming-processed) from master branch. It might have undesired side-effects (e.g. if someone starts somewhere late in the "master" branch, and then those branches would inherit everything it has, instead of starting clean), but I do not see a better way around since we do need to set .gitattributes for the md5 backend while working in those branches, and that is where the conflict emerges -- incoming having it with the backend set and master having not only that (which was ok since identical), but also with setting for what to commit to git/annex

mih commented 6 years ago

Sounds sane to me. I don't see this "late crawler use" is being a comon pattern.