ddrimport files misidentifies CSV as entities

gjost commented 1 year ago

@sarabeckman Running into a weird bug in ddr commandline. I'm trying to run ddrimport file on my local VM V5.4.6_HQMA. The files are transcript files. I'm getting a CSV test error from the importer saying I don't have the required fields however the fields listed are for ddrimport entity not ddrimport file.

The file import CSV ddr-densho-1000-511-transcripts.zip

gjost commented 1 year ago

I was able to duplicate this on my dev VM:

(cmdln) ddr@densho101dev:/opt/ddr-cmdln$ ddrimport file /tmp/ddr-densho-1000-511-transcripts.csv /var/www/media/ddr/ddr-densho-1000 --dryrun=1                                                
2022-10-17 14:57:31,314 DEBUG    <DDR.identifier.Identifier collection:ddr-densho-1000>                                                                                                       
2022-10-17 14:57:31,314 INFO     Checking CSV file                                                                                                                                            
2022-10-17 14:57:31,314 INFO     27 rows                                                                                                                                                      
2022-10-17 14:57:31,315 DEBUG    Guessing model based on 27 rows                                                                                                                              
2022-10-17 14:57:31,315 DEBUG    model: entity                                                                                                                                                
2022-10-17 14:57:31,321 DEBUG    Starting new HTTP connection (1): partner.densho.org:80                                                                                                      
2022-10-17 14:57:31,447 DEBUG    http://partner.densho.org:80 "GET /vocab/api/0.2/index.json HTTP/1.1" 301 None                                                                               
2022-10-17 14:57:31,449 DEBUG    Starting new HTTPS connection (1): partner.densho.org:443                                                                                                    
2022-10-17 14:57:31,721 DEBUG    https://partner.densho.org:443 "GET /vocab/api/0.2/index.json HTTP/1.1" 200 None                                                                             
...
2022-10-17 14:57:33,171 INFO     Validating headers                                            
2022-10-17 14:57:33,171 ERROR    * Missing headers: "status"                                   
2022-10-17 14:57:33,172 ERROR    * Missing headers: "title"                                    
2022-10-17 14:57:33,172 ERROR    * Missing headers: "description"                              
2022-10-17 14:57:33,172 ERROR    * Missing headers: "creation"                                 
2022-10-17 14:57:33,172 ERROR    * Missing headers: "location"                                 
2022-10-17 14:57:33,172 ERROR    * Missing headers: "creators"                                 
2022-10-17 14:57:33,172 ERROR    * Missing headers: "language"                                 
2022-10-17 14:57:33,172 ERROR    * Missing headers: "genre"                                    
2022-10-17 14:57:33,172 ERROR    * Missing headers: "format"                                   
2022-10-17 14:57:33,172 ERROR    * Missing headers: "extent"                                   
2022-10-17 14:57:33,172 ERROR    * Missing headers: "contributor"
2022-10-17 14:57:33,172 ERROR    * Missing headers: "alternate_id"
2022-10-17 14:57:33,172 ERROR    * Missing headers: "digitize_organization"
2022-10-17 14:57:33,172 ERROR    * Missing headers: "digitize_date"
2022-10-17 14:57:33,172 ERROR    * Missing headers: "credit"
2022-10-17 14:57:33,172 ERROR    * Missing headers: "rights_statement"
2022-10-17 14:57:33,172 ERROR    * Missing headers: "topics"
2022-10-17 14:57:33,172 ERROR    * Missing headers: "persons"
2022-10-17 14:57:33,172 ERROR    * Missing headers: "facility"
2022-10-17 14:57:33,172 ERROR    * Missing headers: "chronology"
2022-10-17 14:57:33,172 ERROR    * Missing headers: "geography"
2022-10-17 14:57:33,172 ERROR    * Missing headers: "parent"
2022-10-17 14:57:33,172 ERROR    * Missing headers: "signature_id"
2022-10-17 14:57:33,172 ERROR    * Missing headers: "notes"
2022-10-17 14:57:33,172 ERROR    * Bad headers: "external"
2022-10-17 14:57:33,173 ERROR    * Bad headers: "basename-orig"
2022-10-17 14:57:33,173 ERROR    * Bad headers: "role"
2022-10-17 14:57:33,173 ERROR    * Bad headers: "label"
2022-10-17 14:57:33,173 ERROR    * Bad headers: "mimetype"
2022-10-17 14:57:33,173 ERROR    * Bad headers: "external_urls"
2022-10-17 14:57:33,173 ERROR    * Bad headers: "links"
2022-10-17 14:57:33,173 ERROR    * Bad headers: "tech_notes"
2022-10-17 14:57:33,173 ERROR    headers FAIL
2022-10-17 14:57:33,173 INFO     Validating rows
2022-10-17 14:57:33,183 ERROR    * Duplicate IDs: "row 1: ddr-densho-1000-511"
2022-10-17 14:57:33,183 ERROR    * Missing required fields: "row 0: ddr-densho-1000-511 ['status', 'title', 'genre', 'format', 'extent', 'contributor', 'credit']"
2022-10-17 14:57:33,184 ERROR    * Missing required fields: "row 1: ddr-densho-1000-511 ['status', 'title', 'genre', 'format', 'extent', 'contributor', 'credit']"
2022-10-17 14:57:33,184 ERROR    * Missing required fields: "row 2: ddr-densho-1000-511-1 ['status', 'title', 'genre', 'format', 'extent', 'contributor', 'credit']"
...
2022-10-17 14:57:33,185 ERROR    rows FAIL
2022-10-17 14:57:33,185 ERROR    NOTE: Line numbers in errors may not be exact.
2022-10-17 14:57:33,185 ERROR          Numbering starts at zero and may not include header row.
2022-10-17 14:57:33,185 INFO     Checking repository
2022-10-17 14:57:33,186 INFO     <git.repo.base.Repo '/var/www/media/ddr/ddr-densho-1000/.git'>
2022-10-17 14:57:33,186 DEBUG    Popen(['git', 'diff', '--cached', '--name-only'], cwd=/var/www/media/ddr/ddr-densho-1000, universal_newlines=False, shell=None, istream=None)
2022-10-17 14:57:33,206 DEBUG    Popen(['git', 'diff', '--name-only'], cwd=/var/www/media/ddr/ddr-densho-1000, universal_newlines=False, shell=None, istream=None)
2022-10-17 14:57:33,267 INFO     ok
2022-10-17 14:57:33,267 ERROR    TESTS FAILED--QUITTING!

gjost commented 1 year ago

Working hypothesis is the code that looks at a CSV and guesses the type is busted

gjost commented 1 year ago

There are CSV records for Files attached to both Entity and Segment parents.

id,external,basename-orig,role,sort,label,rights,public,mimetype,external_urls,links,tech_notes,digitize_person
ddr-densho-1000-511,1,ddr-densho-1000-511-transcript.htm,transcript,1,Mary Okazaki Kozu Interview Transcript,cc,1,,,,,
ddr-densho-1000-511,1,ddr-densho-1000-511-glossary.htm,transcript,1,Mary Okazaki Kozu Interview Glossary,cc,1,,,,,
ddr-densho-1000-511-1,1,ddr-densho-1000-511-1-transcript.htm,transcript,1,Mary Okazaki Kozu Interview Segment 1 Transcript,cc,1,,,,,
ddr-densho-1000-511-2,1,ddr-densho-1000-511-2-transcript.htm,transcript,1,Mary Okazaki Kozu Interview Segment 2 Transcript,cc,1,,,,,
...

Maybe this is confusing the algorithm?

(cmdln) ddr@densho101dev:/opt/ddr-cmdln$ ddrimport file /tmp/ddr-densho-1000-511-transcripts.csv /var/www/media/ddr/ddr-densho-1000 --dryrun=1
2022-10-17 15:20:35,228 DEBUG    <DDR.identifier.Identifier collection:ddr-densho-1000>
2022-10-17 15:20:35,228 INFO     Checking CSV file
2022-10-17 15:20:35,228 INFO     27 rows
2022-10-17 15:20:35,229 DEBUG    Guessing model based on 27 rows
models=['entity', 'segment']
errors=[]
model='entity'

gjost commented 1 year ago

Never mind, I tried editing the CSV to import just the Entity transcripts, or just the segment transcripts, but it made no difference

gjost commented 1 year ago

The third field in the header line is written as basename-orig when it should be basename_orig

gjost commented 1 year ago

Also, DDR.batch.Checker._guess_model returns an error:

model_errs=['More than one model type in imput file!']

but this is not reported to the user.

gjost commented 1 year ago

Diagnosis: a combination of two problems:

The CSV header has basename-orig instead of basename_orig. This confuses DDR.batch.Checker._guess_model which looks for a basename_orig field to identify new File objects. New File objects don't yet have an ID, so the CSV id field contains the id of the parent object. If DDR.batch.Checker._guess_model sees an entity or segment identifier and also sees a basename_orig field in headers if identifies the CSV as containing new File objects.
The CSV contains both Entity and Segment transcript Files. This causes DDR.batch.Checker._guess_model to return a model_error complaining about too many model types. Unfortunately this error is not displayed to the user.

Correcting the basename-orig type and separating the CSV into Entity files and Segment files should fix the problem.

Solutions

Either complain when there are multiple model types in the file or make it Just Work(TM) if there are entity and segment.
I don't know what to do about the basename_orig problem yet.

gjost commented 1 year ago

Refactored DDR.batch.Checker._guess_model to look for a list of fields that only appear in File objects, instead of only a single field that may be misspelled.

gjost commented 1 year ago

ddr-cmdln commit e1f7ad4 changed things to make model a required argument and to remove the guessing code.

denshoproject / ddr-cmdln

ddrimport files misidentifies CSV as entities #218

Diagnosis: a combination of two problems:

Solutions