datalad / datalad-crawler

DataLad extension for tracking web resources as datasets
http://datalad.org
Other
5 stars 16 forks source link

crawling sample stanford dataset failed - they have incomplete .tar #18

Open yarikoptic opened 5 years ago

yarikoptic commented 5 years ago

@vsoch , follow up to the https://github.com/datalad/datalad/issues/2814#issuecomment-420660350 where I wanted to demonstrate the power of crawler. It failed and for a reason:

$> datalad crawl-init --save --template=stanford_lib doc_id=jn023kf3320
$> datalad crawl  
...
[INFO   ] Repository found dirty -- adding and committing
[INFO   ] Checking out master into a new branch incoming-processed
[INFO   ] Initiating 1 merge of incoming using strategy theirs
[INFO   ] Adding content of the archive ./Plate_pics_summer_2013_by_plate.tar into annex <AnnexRepo path=/mnt/btrfs/datasets/datalad/crawl-misc/stanford/2012-Mark-Plates (<class 'datalad.support.annexrepo.AnnexRepo'>)>
[INFO   ] Finished adding ./Plate_pics_summer_2013_by_plate.tar: Files processed: 2018, renamed: 2018, +git: 1, +annex: 2017
[INFO   ] Adding content of the archive ./Plate_pics_summer_2014_by_plate.tar into annex <AnnexRepo path=/mnt/btrfs/datasets/datalad/crawl-misc/stanford/2012-Mark-Plates (<class 'datalad.support.annexrepo.AnnexRepo'>)>
[ERROR  ] Command `['/bin/tar', '--extract', '--verbose', '--file', '/mnt/btrfs/datasets/datalad/crawl-misc/stanford/2012-Mark-Plates/.git/annex/objects/mm/vF/MD5E-s7076741120--a9593225a40a297ec98f65e3fab604b2.tar/MD5E-s7076741120--a9593225a40a297ec98f65e3fab604b2.tar', '--directory', '/mnt/btrfs/datasets/datalad/crawl-misc/stanford/2012-Mark-Plates/.git/datalad/tmp/archives/d601a8f512']' returned non-zero exit status 2 [util.py:run_checked:227] (PatoolError)
datalad crawl  3254.09s user 449.61s system 12% cpu 8:31:45.29 total
neurodebian has logged on pts/3 from 73.114.21.37.
...

and indeed the tar file is broken:

$> tar -tvf ./Plate_pics_summer_2014_by_plate.tar
...
-rw-r--r-- mgolson/staff 4100484 2017-01-07 05:05 Plate_pics_summer_2014_by_plate/Plate_077/20140616_plate_077.jpg
-rw-r--r-- mgolson/staff 3869448 2017-01-07 05:05 Plate_pics_summer_2014_by_plate/Plate_077/20140715_plate_077.jpg
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now

either we haven't downloaded it in full or it is broken on the server and noone detected it since noone actually looked into it. So let's check the size:

$> git annex whereis ./Plate_pics_summer_2014_by_plate.tar
whereis Plate_pics_summer_2014_by_plate.tar (2 copies) 
        00000000-0000-0000-0000-000000000001 -- web
        e25a3eae-b7d9-4e4a-bec4-76dc67724d52 -- yoh@smaug:/mnt/btrfs/datasets/datalad/crawl-misc/stanford/2012-Mark-Plates [here]

  web: https://stacks.stanford.edu/file/druid:jn023kf3320/Plate_pics_summer_2014_by_plate.tar
ok

$> ls -ld ./Plate_pics_summer_2014_by_plate.tar                                                                
lrwxrwxrwx 1 yoh datalad 134 Sep 12 18:14 ./Plate_pics_summer_2014_by_plate.tar -> .git/annex/objects/mm/vF/MD5E-s7076741120--a9593225a40a297ec98f65e3fab604b2.tar/MD5E-s7076741120--a9593225a40a297ec98f65e3fab604b2.tar

so the annex key says it is 7076741120 and server:

$> wget -S https://stacks.stanford.edu/file/druid:jn023kf3320/Plate_pics_summer_2014_by_plate.tar 2>&1 | grep Length 
  Content-Length: 7076741120
Length: 7076741120 (6.6G) [application/octet-stream]

so we did get it in full and 99% likely that it is just incomplete on the server. So, @vsoch if you are looking to make friends with admins, please channel this report to SDR folks ;-)

yarikoptic-gitmate commented 5 years ago

GitMate.io thinks a possibly related issue is https://github.com/datalad/datalad-crawler/issues/12 (Running crawl-init in a non-dataset brings confusing and irrlevant error message).

vsoch commented 5 years ago

Will do! Hey @cmh2166 @hannahfrost ping! Dataset is down! I repeat, dataset is down! (see above)

yarikoptic commented 5 years ago

@vsoch is there some official portal to submit a "bug report" against SDR?

vsoch commented 5 years ago

What is SDR?

vsoch commented 5 years ago

ohh the library? How have you interacted with them before?

yarikoptic commented 5 years ago

with Hannah Frost and ATM Amy Hodge sits across the table (I am at Stanford ATM)

vsoch commented 5 years ago

If you can't talk to them directly or submit an issue on Github --> https://github.com/sul-dlss then Stanford has it's "HelpSU" system for submitting tickets --> https://helpsu.stanford.edu/helpsu/3.0/helpsu but it would probably just get to their group at the end of the day (and possibly get lost, lol). I'm not sure if you need a sunetid too.

yarikoptic commented 5 years ago

Amy said to use the Feedback dialog on that dataset page... Will do after remember which one ;-)