datagouv / csv-detective

CSV inspection
45 stars 10 forks source link

Performance issue with csv-detective #68

Closed maudetes closed 9 months ago

maudetes commented 10 months ago

When applying csv-detective routine (with num_rows=-1) on the datasets catalog (~100Mo), the global amount of time is of ~160 seconds.

Majority of this time comes from Testing columns to a great extent (~96%).

Verbose logs in detail ``` INFO:root:Detecting encoding INFO:root:Detected encoding: "UTF-8" in 0.213s (confidence: 99%) INFO:root:Detecting separator INFO:root:Detected separator: ";" in 0.0s INFO:root:Detecting headers INFO:root:Detected headers in 0.0s INFO:root:Detecting heading columns INFO:root:No heading column detected in 0.0s INFO:root:Detecting trailing columns INFO:root:No trailing column detected in 0.0s INFO:root:Parsing table WARNING:root:Table parsed successfully in 2.613s INFO:root:Detecting categorical columns INFO:root:Detected 6 categorical columns out of 30 in 0.658s INFO:root:Testing columns to get types CRITICAL:root: - Done with type "date" in 21.878s (1/47) INFO:root: - Done with type "year" in 0.305s (2/47) INFO:root: - Done with type "email" in 0.389s (3/47) INFO:root: - Done with type "mongo_object_id" in 0.418s (4/47) INFO:root: - Done with type "uuid" in 0.41s (5/47) INFO:root: - Done with type "url" in 0.335s (6/47) INFO:root: - Done with type "iso_country_code_alpha2" in 0.308s (7/47) INFO:root: - Done with type "iso_country_code_alpha3" in 0.35s (8/47) INFO:root: - Done with type "iso_country_code_numeric" in 0.324s (9/47) INFO:root: - Done with type "jour_de_la_semaine" in 0.353s (10/47) INFO:root: - Done with type "csp_insee" in 0.33s (11/47) INFO:root: - Done with type "tel_fr" in 0.357s (12/47) INFO:root: - Done with type "siren" in 0.348s (13/47) INFO:root: - Done with type "code_csp_insee" in 0.313s (14/47) INFO:root: - Done with type "sexe" in 0.286s (15/47) CRITICAL:root: - Done with type "pays" in 17.903s (16/47) INFO:root: - Done with type "code_departement" in 0.407s (17/47) CRITICAL:root: - Done with type "adresse" in 18.212s (18/47) INFO:root: - Done with type "code_commune_insee" in 0.363s (19/47) CRITICAL:root: - Done with type "commune" in 20.625s (20/47) INFO:root: - Done with type "region" in 0.647s (21/47) INFO:root: - Done with type "code_postal" in 0.587s (22/47) CRITICAL:root: - Done with type "departement" in 22.128s (23/47) INFO:root: - Done with type "uai" in 0.495s (24/47) INFO:root: - Done with type "siret" in 0.569s (25/47) CRITICAL:root: - Done with type "latitude_wgs" in 3.878s (26/47) CRITICAL:root: - Done with type "longitude_wgs" in 5.02s (27/47) INFO:root: - Done with type "latlon_wgs" in 0.406s (28/47) INFO:root: - Done with type "json_geojson" in 0.579s (29/47) INFO:root: - Done with type "code_fantoir" in 0.438s (30/47) INFO:root: - Done with type "insee_ape700" in 0.388s (31/47) INFO:root: - Done with type "datetime_iso" in 0.451s (32/47) INFO:root: - Done with type "datetime_rfc822" in 0.402s (33/47) CRITICAL:root: - Done with type "latitude_wgs_fr_metropole" in 3.489s (34/47) CRITICAL:root: - Done with type "longitude_wgs_fr_metropole" in 3.126s (35/47) INFO:root: - Done with type "code_region" in 0.347s (36/47) INFO:root: - Done with type "booleen" in 0.404s (37/47) INFO:root: - Done with type "twitter" in 0.357s (38/47) WARNING:root: - Done with type "float" in 1.248s (39/47) WARNING:root: - Done with type "int" in 1.056s (40/47) INFO:root: - Done with type "json" in 0.433s (41/47) CRITICAL:root: - Done with type "latitude_l93" in 3.56s (42/47) CRITICAL:root: - Done with type "longitude_l93" in 3.231s (43/47) CRITICAL:root: - Done with type "insee_canton" in 19.299s (44/47) INFO:root: - Done with type "date_fr" in 0.347s (45/47) INFO:root: - Done with type "code_waldec" in 0.494s (46/47) INFO:root: - Done with type "code_rna" in 0.44s (47/47) CRITICAL:root:Done testing columns in 158.045s INFO:root:Testing labels to get types INFO:root: - Done with type "adresse" in 0.002s (1/48) INFO:root: - Done with type "code_commune_insee" in 0.002s (2/48) INFO:root: - Done with type "code_departement" in 0.002s (3/48) INFO:root: - Done with type "code_fantoir" in 0.002s (4/48) INFO:root: - Done with type "code_postal" in 0.003s (5/48) INFO:root: - Done with type "code_region" in 0.002s (6/48) INFO:root: - Done with type "commune" in 0.002s (7/48) INFO:root: - Done with type "departement" in 0.003s (8/48) INFO:root: - Done with type "insee_canton" in 0.003s (9/48) INFO:root: - Done with type "latitude_l93" in 0.003s (10/48) INFO:root: - Done with type "latitude_wgs_fr_metropole" in 0.003s (11/48) INFO:root: - Done with type "longitude_l93" in 0.003s (12/48) INFO:root: - Done with type "longitude_wgs_fr_metropole" in 0.002s (13/48) INFO:root: - Done with type "pays" in 0.003s (14/48) INFO:root: - Done with type "region" in 0.002s (15/48) INFO:root: - Done with type "code_csp_insee" in 0.002s (16/48) INFO:root: - Done with type "code_rna" in 0.002s (17/48) INFO:root: - Done with type "code_waldec" in 0.002s (18/48) INFO:root: - Done with type "csp_insee" in 0.002s (19/48) INFO:root: - Done with type "date_fr" in 0.002s (20/48) INFO:root: - Done with type "insee_ape700" in 0.002s (21/48) INFO:root: - Done with type "sexe" in 0.002s (22/48) INFO:root: - Done with type "siren" in 0.004s (23/48) INFO:root: - Done with type "siret" in 0.004s (24/48) INFO:root: - Done with type "tel_fr" in 0.003s (25/48) INFO:root: - Done with type "uai" in 0.002s (26/48) INFO:root: - Done with type "jour_de_la_semaine" in 0.002s (27/48) INFO:root: - Done with type "mois_de_annee" in 0.002s (28/48) INFO:root: - Done with type "iso_country_code_alpha2" in 0.003s (29/48) INFO:root: - Done with type "iso_country_code_alpha3" in 0.002s (30/48) INFO:root: - Done with type "iso_country_code_numeric" in 0.002s (31/48) INFO:root: - Done with type "json_geojson" in 0.002s (32/48) INFO:root: - Done with type "latitude_wgs" in 0.003s (33/48) INFO:root: - Done with type "latlon_wgs" in 0.004s (34/48) INFO:root: - Done with type "longitude_wgs" in 0.003s (35/48) INFO:root: - Done with type "booleen" in 0.002s (36/48) INFO:root: - Done with type "email" in 0.003s (37/48) INFO:root: - Done with type "mongo_object_id" in 0.003s (38/48) INFO:root: - Done with type "uuid" in 0.002s (39/48) INFO:root: - Done with type "float" in 0.002s (40/48) INFO:root: - Done with type "int" in 0.002s (41/48) INFO:root: - Done with type "money" in 0.002s (42/48) INFO:root: - Done with type "twitter" in 0.002s (43/48) INFO:root: - Done with type "url" in 0.003s (44/48) INFO:root: - Done with type "date" in 0.003s (45/48) INFO:root: - Done with type "datetime_iso" in 0.003s (46/48) INFO:root: - Done with type "datetime_rfc822" in 0.003s (47/48) INFO:root: - Done with type "year" in 0.002s (48/48) INFO:root:Done testing labels in 0.133s INFO:root:Creating profile WARNING:root:Created profile in 2.445s CRITICAL:root:Routine completed in 164.138s ```

This ends up timing out in hydra workers, making csv parsing fail : https://errors.data.gouv.fr/organizations/sentry/issues/129487/events/fced5f3fae964450b7d249efa9a35f96/?project=2&referrer=issue-list&statsPeriod=14d

Pierlou commented 10 months ago

This patch : https://github.com/etalab/csv-detective/pull/69 improves performances, for the same file we are down to 30 seconds for the analysis. Hopefully it'll allow us to keep the timeout low 🙏

maudetes commented 9 months ago

Seemed to have worked for our needs! :clap: