dolthub / dolt

Dolt – Git for Data
Apache License 2.0
17.78k stars 506 forks source link

Dolt export/import warts #8386

Closed max-hoffman closed 1 day ago

max-hoffman commented 1 day ago

I had to do some data processing to export/import a portion of a table from the mediawiki database (https://www.dolthub.com/repositories/timsehn/media_wiki/query/main?active=Tables&q=show+create+table+categorylinks%3B).

There were three problems:

  1. The name of the exported file has a double .csv extension by default (.csv.csv)
  2. There are line breaks in the CSV, which i can't tell are because of \n characters or some other decoding problem.
  3. Importing into a field with a default value for nulls errors for null values.

Original export:

cl_from,cl_to,cl_sortkey,cl_sortkey_prefix,cl_timestamp,cl_collation,cl_type
13303,Communes_of_Tarn-et-Garonne,MONTAUBAN,,2024-03-12 02:10:49,uppercase,page
13303,Pages_with_broken_file_links,MONTAUBAN,,2024-03-12 02:10:49,uppercase,page
13303,Prefectures_in_France,MONTAUBAN,,2024-03-12 02:10:49,uppercase,page
13303,Quercy,MONTAUBAN,,2024-03-12 02:10:49,uppercase,page
13304,Miniature_railways,
RAIL TRANSPORT MODELLING, ,2024-03-12 02:10:50,uppercase,page
13304,Pages_with_broken_file_links,RAIL TRANSPORT MODELLING,,2024-03-12 02:10:50,uppercase,page
13304,Rail_transport_modelling, RAIL TRANSPORT MODELLING, ,2024-03-12 02:10:50,uppercase,page
13305,Linguistic_morphology,MORPHOPHONOLOGY,,2024-03-12 02:10:51,uppercase,page
13305,Morphophonology, MORPHOPHONOLOGY, ,2024-03-12 02:10:51,uppercase,page
13305,Orthography,MORPHOPHONOLOGY,,2024-03-12 02:10:51,uppercase,page
13305,Phonology,MORPHOPHONOLOGY,,2024-03-12 02:10:51,uppercase,page
13306,Glass_applications,MIRROR,,2024-03-12 02:10:52,uppercase,page

Error for null field:

An error occurred while moving data
cause: column name 'cl_sortkey_prefix' is non-nullable but attempted to set a value of null
       A bad row was encountered inserting into table categorylinks (on line 9):
        cl_from: 13305
        cl_to: Linguistic_morphology
        cl_sortkey: MORPHOPHONOLOGY
        cl_sortkey_prefix: <nil>
        cl_timestamp: 2024-03-12 02:10:51
        cl_collation: uppercase
        cl_type: page

error for bad line break:

An error occurred while moving data
cause: CSV reader expected 7 values, but saw 3.
       row values: '{
        "cl_from": "13304"
        "cl_to": "Miniature_railways"
        "cl_sortkey": ""
        "cl_sortkey_prefix": ""
        "cl_timestamp": ""
        "cl_collation": ""
        "cl_type": ""
       }
       '
       A bad row was encountered inserting into table categorylinks (on line 6):
        cl_from: 13304
        cl_to: Miniature_railways
        cl_sortkey: <nil>

functioning script with CSV data edits:

#!/bin/bash

rm -rf tmp1
mkdir tmp1
cd tmp1
dolt init

dolt sql <<EOF
CREATE TABLE categorylinks (
  cl_from int unsigned NOT NULL DEFAULT '0',
  cl_to varbinary(255) NOT NULL DEFAULT '',
  cl_sortkey varbinary(230) NOT NULL DEFAULT '',
  cl_sortkey_prefix varbinary(255) NOT NULL DEFAULT '',
  cl_timestamp timestamp NOT NULL,
  cl_collation varbinary(32) NOT NULL DEFAULT '',
  cl_type enum('page','subcat','file') NOT NULL DEFAULT '1',
  PRIMARY KEY (cl_from,cl_to),
  KEY cl_collation_ext (cl_collation,cl_to,cl_type,cl_from),
  KEY cl_sortkey (cl_to,cl_type,cl_sortkey,cl_from),
  KEY cl_timestamp (cl_to,cl_timestamp)
);
EOF

cat <<EOF > test1.csv
cl_from,cl_to,cl_sortkey,cl_sortkey_prefix,cl_timestamp,cl_collation,cl_type
13303,Communes_of_Tarn-et-Garonne,MONTAUBAN,'',2024-03-12 02:10:49,uppercase,page
13303,Pages_with_broken_file_links,MONTAUBAN,'',2024-03-12 02:10:49,uppercase,page
13303,Prefectures_in_France,MONTAUBAN,'',2024-03-12 02:10:49,uppercase,page
13303,Quercy,MONTAUBAN,'',2024-03-12 02:10:49,uppercase,page
13304,Miniature_railways, RAIL TRANSPORT MODELLING,'',2024-03-12 02:10:50,uppercase,page
13304,Pages_with_broken_file_links,RAIL TRANSPORT MODELLING,'',2024-03-12 02:10:50,uppercase,page
13304,Rail_transport_modelling, RAIL TRANSPORT MODELLING,'',2024-03-12 02:10:50,uppercase,page
13305,Linguistic_morphology,MORPHOPHONOLOGY,'',2024-03-12 02:10:51,uppercase,page
13305,Morphophonology, MORPHOPHONOLOGY,'',2024-03-12 02:10:51,uppercase,page
13305,Orthography,MORPHOPHONOLOGY,'',2024-03-12 02:10:51,uppercase,page
13305,Phonology,MORPHOPHONOLOGY,'',2024-03-12 02:10:51,uppercase,page
13306,Glass_applications,MIRROR,'',2024-03-12 02:10:52,uppercase,page
EOF

dolt table import -u categorylinks test1.csv
timsehn commented 1 day ago

Maybe break these into three specific bugs?

max-hoffman commented 1 day ago

split off 4 issues https://github.com/dolthub/dolt/issues/8387 https://github.com/dolthub/dolt/issues/8388 https://github.com/dolthub/dolt/issues/8389 https://github.com/dolthub/ld/issues/18508