Closed gjost closed 1 year ago
I was able to duplicate this on my dev VM:
(cmdln) ddr@densho101dev:/opt/ddr-cmdln$ ddrimport file /tmp/ddr-densho-1000-511-transcripts.csv /var/www/media/ddr/ddr-densho-1000 --dryrun=1
2022-10-17 14:57:31,314 DEBUG <DDR.identifier.Identifier collection:ddr-densho-1000>
2022-10-17 14:57:31,314 INFO Checking CSV file
2022-10-17 14:57:31,314 INFO 27 rows
2022-10-17 14:57:31,315 DEBUG Guessing model based on 27 rows
2022-10-17 14:57:31,315 DEBUG model: entity
2022-10-17 14:57:31,321 DEBUG Starting new HTTP connection (1): partner.densho.org:80
2022-10-17 14:57:31,447 DEBUG http://partner.densho.org:80 "GET /vocab/api/0.2/index.json HTTP/1.1" 301 None
2022-10-17 14:57:31,449 DEBUG Starting new HTTPS connection (1): partner.densho.org:443
2022-10-17 14:57:31,721 DEBUG https://partner.densho.org:443 "GET /vocab/api/0.2/index.json HTTP/1.1" 200 None
...
2022-10-17 14:57:33,171 INFO Validating headers
2022-10-17 14:57:33,171 ERROR * Missing headers: "status"
2022-10-17 14:57:33,172 ERROR * Missing headers: "title"
2022-10-17 14:57:33,172 ERROR * Missing headers: "description"
2022-10-17 14:57:33,172 ERROR * Missing headers: "creation"
2022-10-17 14:57:33,172 ERROR * Missing headers: "location"
2022-10-17 14:57:33,172 ERROR * Missing headers: "creators"
2022-10-17 14:57:33,172 ERROR * Missing headers: "language"
2022-10-17 14:57:33,172 ERROR * Missing headers: "genre"
2022-10-17 14:57:33,172 ERROR * Missing headers: "format"
2022-10-17 14:57:33,172 ERROR * Missing headers: "extent"
2022-10-17 14:57:33,172 ERROR * Missing headers: "contributor"
2022-10-17 14:57:33,172 ERROR * Missing headers: "alternate_id"
2022-10-17 14:57:33,172 ERROR * Missing headers: "digitize_organization"
2022-10-17 14:57:33,172 ERROR * Missing headers: "digitize_date"
2022-10-17 14:57:33,172 ERROR * Missing headers: "credit"
2022-10-17 14:57:33,172 ERROR * Missing headers: "rights_statement"
2022-10-17 14:57:33,172 ERROR * Missing headers: "topics"
2022-10-17 14:57:33,172 ERROR * Missing headers: "persons"
2022-10-17 14:57:33,172 ERROR * Missing headers: "facility"
2022-10-17 14:57:33,172 ERROR * Missing headers: "chronology"
2022-10-17 14:57:33,172 ERROR * Missing headers: "geography"
2022-10-17 14:57:33,172 ERROR * Missing headers: "parent"
2022-10-17 14:57:33,172 ERROR * Missing headers: "signature_id"
2022-10-17 14:57:33,172 ERROR * Missing headers: "notes"
2022-10-17 14:57:33,172 ERROR * Bad headers: "external"
2022-10-17 14:57:33,173 ERROR * Bad headers: "basename-orig"
2022-10-17 14:57:33,173 ERROR * Bad headers: "role"
2022-10-17 14:57:33,173 ERROR * Bad headers: "label"
2022-10-17 14:57:33,173 ERROR * Bad headers: "mimetype"
2022-10-17 14:57:33,173 ERROR * Bad headers: "external_urls"
2022-10-17 14:57:33,173 ERROR * Bad headers: "links"
2022-10-17 14:57:33,173 ERROR * Bad headers: "tech_notes"
2022-10-17 14:57:33,173 ERROR headers FAIL
2022-10-17 14:57:33,173 INFO Validating rows
2022-10-17 14:57:33,183 ERROR * Duplicate IDs: "row 1: ddr-densho-1000-511"
2022-10-17 14:57:33,183 ERROR * Missing required fields: "row 0: ddr-densho-1000-511 ['status', 'title', 'genre', 'format', 'extent', 'contributor', 'credit']"
2022-10-17 14:57:33,184 ERROR * Missing required fields: "row 1: ddr-densho-1000-511 ['status', 'title', 'genre', 'format', 'extent', 'contributor', 'credit']"
2022-10-17 14:57:33,184 ERROR * Missing required fields: "row 2: ddr-densho-1000-511-1 ['status', 'title', 'genre', 'format', 'extent', 'contributor', 'credit']"
...
2022-10-17 14:57:33,185 ERROR rows FAIL
2022-10-17 14:57:33,185 ERROR NOTE: Line numbers in errors may not be exact.
2022-10-17 14:57:33,185 ERROR Numbering starts at zero and may not include header row.
2022-10-17 14:57:33,185 INFO Checking repository
2022-10-17 14:57:33,186 INFO <git.repo.base.Repo '/var/www/media/ddr/ddr-densho-1000/.git'>
2022-10-17 14:57:33,186 DEBUG Popen(['git', 'diff', '--cached', '--name-only'], cwd=/var/www/media/ddr/ddr-densho-1000, universal_newlines=False, shell=None, istream=None)
2022-10-17 14:57:33,206 DEBUG Popen(['git', 'diff', '--name-only'], cwd=/var/www/media/ddr/ddr-densho-1000, universal_newlines=False, shell=None, istream=None)
2022-10-17 14:57:33,267 INFO ok
2022-10-17 14:57:33,267 ERROR TESTS FAILED--QUITTING!
Working hypothesis is the code that looks at a CSV and guesses the type is busted
There are CSV records for Files
attached to both Entity
and Segment
parents.
id,external,basename-orig,role,sort,label,rights,public,mimetype,external_urls,links,tech_notes,digitize_person
ddr-densho-1000-511,1,ddr-densho-1000-511-transcript.htm,transcript,1,Mary Okazaki Kozu Interview Transcript,cc,1,,,,,
ddr-densho-1000-511,1,ddr-densho-1000-511-glossary.htm,transcript,1,Mary Okazaki Kozu Interview Glossary,cc,1,,,,,
ddr-densho-1000-511-1,1,ddr-densho-1000-511-1-transcript.htm,transcript,1,Mary Okazaki Kozu Interview Segment 1 Transcript,cc,1,,,,,
ddr-densho-1000-511-2,1,ddr-densho-1000-511-2-transcript.htm,transcript,1,Mary Okazaki Kozu Interview Segment 2 Transcript,cc,1,,,,,
...
Maybe this is confusing the algorithm?
(cmdln) ddr@densho101dev:/opt/ddr-cmdln$ ddrimport file /tmp/ddr-densho-1000-511-transcripts.csv /var/www/media/ddr/ddr-densho-1000 --dryrun=1
2022-10-17 15:20:35,228 DEBUG <DDR.identifier.Identifier collection:ddr-densho-1000>
2022-10-17 15:20:35,228 INFO Checking CSV file
2022-10-17 15:20:35,228 INFO 27 rows
2022-10-17 15:20:35,229 DEBUG Guessing model based on 27 rows
models=['entity', 'segment']
errors=[]
model='entity'
Never mind, I tried editing the CSV to import just the Entity
transcripts, or just the segment
transcripts, but it made no difference
The third field in the header line is written as basename-orig
when it should be basename_orig
Also, DDR.batch.Checker._guess_model
returns an error:
model_errs=['More than one model type in imput file!']
but this is not reported to the user.
The CSV header has basename-orig
instead of basename_orig
. This confuses DDR.batch.Checker._guess_model
which looks for a basename_orig
field to identify new File
objects. New File
objects don't yet have an ID, so the CSV id
field contains the id
of the parent object. If DDR.batch.Checker._guess_model
sees an entity
or segment
identifier
and also sees a basename_orig
field in headers
if identifies the CSV as containing new File
objects.
The CSV contains both Entity
and Segment
transcript Files
. This causes DDR.batch.Checker._guess_model
to return a model_error
complaining about too many model types. Unfortunately this error is not displayed to the user.
Correcting the basename-orig
type and separating the CSV into Entity
files and Segment
files should fix the problem.
entity
and segment
.basename_orig
problem yet.Refactored DDR.batch.Checker._guess_model
to look for a list of fields that only appear in File
objects, instead of only a single field that may be misspelled.
ddr-cmdln
commit e1f7ad4
changed things to make model
a required argument and to remove the guessing code.
@sarabeckman Running into a weird bug in ddr commandline. I'm trying to run
ddrimport file
on my local VM V5.4.6_HQMA. The files are transcript files. I'm getting a CSV test error from the importer saying I don't have the required fields however the fields listed are forddrimport entity
notddrimport file
.The file import CSV ddr-densho-1000-511-transcripts.zip