denshoproject / ddr-cmdln

Command-line tools for automating the Densho Digital Repository's various processes.
Other
0 stars 2 forks source link

ddrimport files misidentifies CSV as entities #218

Closed gjost closed 1 year ago

gjost commented 1 year ago

@sarabeckman Running into a weird bug in ddr commandline. I'm trying to run ddrimport file on my local VM V5.4.6_HQMA. The files are transcript files. I'm getting a CSV test error from the importer saying I don't have the required fields however the fields listed are for ddrimport entity not ddrimport file.

The file import CSV ddr-densho-1000-511-transcripts.zip

gjost commented 1 year ago

I was able to duplicate this on my dev VM:

(cmdln) ddr@densho101dev:/opt/ddr-cmdln$ ddrimport file /tmp/ddr-densho-1000-511-transcripts.csv /var/www/media/ddr/ddr-densho-1000 --dryrun=1                                                
2022-10-17 14:57:31,314 DEBUG    <DDR.identifier.Identifier collection:ddr-densho-1000>                                                                                                       
2022-10-17 14:57:31,314 INFO     Checking CSV file                                                                                                                                            
2022-10-17 14:57:31,314 INFO     27 rows                                                                                                                                                      
2022-10-17 14:57:31,315 DEBUG    Guessing model based on 27 rows                                                                                                                              
2022-10-17 14:57:31,315 DEBUG    model: entity                                                                                                                                                
2022-10-17 14:57:31,321 DEBUG    Starting new HTTP connection (1): partner.densho.org:80                                                                                                      
2022-10-17 14:57:31,447 DEBUG    http://partner.densho.org:80 "GET /vocab/api/0.2/index.json HTTP/1.1" 301 None                                                                               
2022-10-17 14:57:31,449 DEBUG    Starting new HTTPS connection (1): partner.densho.org:443                                                                                                    
2022-10-17 14:57:31,721 DEBUG    https://partner.densho.org:443 "GET /vocab/api/0.2/index.json HTTP/1.1" 200 None                                                                             
...
2022-10-17 14:57:33,171 INFO     Validating headers                                            
2022-10-17 14:57:33,171 ERROR    * Missing headers: "status"                                   
2022-10-17 14:57:33,172 ERROR    * Missing headers: "title"                                    
2022-10-17 14:57:33,172 ERROR    * Missing headers: "description"                              
2022-10-17 14:57:33,172 ERROR    * Missing headers: "creation"                                 
2022-10-17 14:57:33,172 ERROR    * Missing headers: "location"                                 
2022-10-17 14:57:33,172 ERROR    * Missing headers: "creators"                                 
2022-10-17 14:57:33,172 ERROR    * Missing headers: "language"                                 
2022-10-17 14:57:33,172 ERROR    * Missing headers: "genre"                                    
2022-10-17 14:57:33,172 ERROR    * Missing headers: "format"                                   
2022-10-17 14:57:33,172 ERROR    * Missing headers: "extent"                                   
2022-10-17 14:57:33,172 ERROR    * Missing headers: "contributor"
2022-10-17 14:57:33,172 ERROR    * Missing headers: "alternate_id"
2022-10-17 14:57:33,172 ERROR    * Missing headers: "digitize_organization"
2022-10-17 14:57:33,172 ERROR    * Missing headers: "digitize_date"
2022-10-17 14:57:33,172 ERROR    * Missing headers: "credit"
2022-10-17 14:57:33,172 ERROR    * Missing headers: "rights_statement"
2022-10-17 14:57:33,172 ERROR    * Missing headers: "topics"
2022-10-17 14:57:33,172 ERROR    * Missing headers: "persons"
2022-10-17 14:57:33,172 ERROR    * Missing headers: "facility"
2022-10-17 14:57:33,172 ERROR    * Missing headers: "chronology"
2022-10-17 14:57:33,172 ERROR    * Missing headers: "geography"
2022-10-17 14:57:33,172 ERROR    * Missing headers: "parent"
2022-10-17 14:57:33,172 ERROR    * Missing headers: "signature_id"
2022-10-17 14:57:33,172 ERROR    * Missing headers: "notes"
2022-10-17 14:57:33,172 ERROR    * Bad headers: "external"
2022-10-17 14:57:33,173 ERROR    * Bad headers: "basename-orig"
2022-10-17 14:57:33,173 ERROR    * Bad headers: "role"
2022-10-17 14:57:33,173 ERROR    * Bad headers: "label"
2022-10-17 14:57:33,173 ERROR    * Bad headers: "mimetype"
2022-10-17 14:57:33,173 ERROR    * Bad headers: "external_urls"
2022-10-17 14:57:33,173 ERROR    * Bad headers: "links"
2022-10-17 14:57:33,173 ERROR    * Bad headers: "tech_notes"
2022-10-17 14:57:33,173 ERROR    headers FAIL
2022-10-17 14:57:33,173 INFO     Validating rows
2022-10-17 14:57:33,183 ERROR    * Duplicate IDs: "row 1: ddr-densho-1000-511"
2022-10-17 14:57:33,183 ERROR    * Missing required fields: "row 0: ddr-densho-1000-511 ['status', 'title', 'genre', 'format', 'extent', 'contributor', 'credit']"
2022-10-17 14:57:33,184 ERROR    * Missing required fields: "row 1: ddr-densho-1000-511 ['status', 'title', 'genre', 'format', 'extent', 'contributor', 'credit']"
2022-10-17 14:57:33,184 ERROR    * Missing required fields: "row 2: ddr-densho-1000-511-1 ['status', 'title', 'genre', 'format', 'extent', 'contributor', 'credit']"
...
2022-10-17 14:57:33,185 ERROR    rows FAIL
2022-10-17 14:57:33,185 ERROR    NOTE: Line numbers in errors may not be exact.
2022-10-17 14:57:33,185 ERROR          Numbering starts at zero and may not include header row.
2022-10-17 14:57:33,185 INFO     Checking repository
2022-10-17 14:57:33,186 INFO     <git.repo.base.Repo '/var/www/media/ddr/ddr-densho-1000/.git'>
2022-10-17 14:57:33,186 DEBUG    Popen(['git', 'diff', '--cached', '--name-only'], cwd=/var/www/media/ddr/ddr-densho-1000, universal_newlines=False, shell=None, istream=None)
2022-10-17 14:57:33,206 DEBUG    Popen(['git', 'diff', '--name-only'], cwd=/var/www/media/ddr/ddr-densho-1000, universal_newlines=False, shell=None, istream=None)
2022-10-17 14:57:33,267 INFO     ok
2022-10-17 14:57:33,267 ERROR    TESTS FAILED--QUITTING!
gjost commented 1 year ago

Working hypothesis is the code that looks at a CSV and guesses the type is busted

gjost commented 1 year ago

There are CSV records for Files attached to both Entity and Segment parents.

id,external,basename-orig,role,sort,label,rights,public,mimetype,external_urls,links,tech_notes,digitize_person
ddr-densho-1000-511,1,ddr-densho-1000-511-transcript.htm,transcript,1,Mary Okazaki Kozu Interview Transcript,cc,1,,,,,
ddr-densho-1000-511,1,ddr-densho-1000-511-glossary.htm,transcript,1,Mary Okazaki Kozu Interview Glossary,cc,1,,,,,
ddr-densho-1000-511-1,1,ddr-densho-1000-511-1-transcript.htm,transcript,1,Mary Okazaki Kozu Interview Segment 1 Transcript,cc,1,,,,,
ddr-densho-1000-511-2,1,ddr-densho-1000-511-2-transcript.htm,transcript,1,Mary Okazaki Kozu Interview Segment 2 Transcript,cc,1,,,,,
...

Maybe this is confusing the algorithm?

(cmdln) ddr@densho101dev:/opt/ddr-cmdln$ ddrimport file /tmp/ddr-densho-1000-511-transcripts.csv /var/www/media/ddr/ddr-densho-1000 --dryrun=1
2022-10-17 15:20:35,228 DEBUG    <DDR.identifier.Identifier collection:ddr-densho-1000>
2022-10-17 15:20:35,228 INFO     Checking CSV file
2022-10-17 15:20:35,228 INFO     27 rows
2022-10-17 15:20:35,229 DEBUG    Guessing model based on 27 rows
models=['entity', 'segment']
errors=[]
model='entity'
gjost commented 1 year ago

Never mind, I tried editing the CSV to import just the Entity transcripts, or just the segment transcripts, but it made no difference

gjost commented 1 year ago

The third field in the header line is written as basename-orig when it should be basename_orig

gjost commented 1 year ago

Also, DDR.batch.Checker._guess_model returns an error:

model_errs=['More than one model type in imput file!']

but this is not reported to the user.

gjost commented 1 year ago

Diagnosis: a combination of two problems:

Correcting the basename-orig type and separating the CSV into Entity files and Segment files should fix the problem.

Solutions

gjost commented 1 year ago

Refactored DDR.batch.Checker._guess_model to look for a list of fields that only appear in File objects, instead of only a single field that may be misspelled.

gjost commented 1 year ago

ddr-cmdln commit e1f7ad4 changed things to make model a required argument and to remove the guessing code.