etalab / transport-validator

GTFS validator
https://transport.data.gouv.fr/validation/
MIT License
37 stars 10 forks source link

Limit the number of related objects #101

Closed antoine-de closed 3 years ago

antoine-de commented 3 years ago

I think we should never link to several Trip in the related objects as this can make the number of related objects explode.

For instance instead of listing the trips that make 2 stops being too close maybe we can just add the Route as the related object (and the number of trip ? and / or an example of a trip on this route that causes a problem ?)

linked to https://github.com/etalab/transport-site/issues/1360

thbar commented 3 years ago

Preliminary notes, in brain dump mode for now:

I have first looked at the transport-site database, which contains history for validations, using:

SELECT
    validation_json_size,
    to_char(100 * validation_json_size::float / (SUM(validation_json_size) OVER()) , '999D99%') as ratio,
    subquery.id,
    resource_id,
    url
FROM (SELECT *, pg_column_size(details) as validation_json_size from validations) as subquery
INNER JOIN resource r on r.id = resource_id
WHERE validation_json_size IS NOT NULL
ORDER by validation_json_size desc

On a recent database, this gives (before optimisation - just the top extract):

validation_json_size ratio id resource_id url
1489511 16.68% 102514 10391 http://breizh.opendatasoft.com/api/datasets/1.0/base-de-donnees-multimodale-transports-publics-en-bretagne-mobibreizh/attachments/17_07_20_mobibreizhbret_gtfs_zip
1103264 12.35% 110330 12714 http://breizh.opendatasoft.com/api/datasets/1.0/base-de-donnees-multimodale-transports-publics-en-bretagne-mobibreizh/attachments/02_10_2020_mobibreizhbret_gtfs_zip
1066277 11.94% 104949 10987 http://breizh.opendatasoft.com/api/datasets/1.0/base-de-donnees-multimodale-transports-publics-en-bretagne-mobibreizh/attachments/10_08_2020_mobibreizhbret_gtfs_zip
1039256 11.64% 101524 10248 http://breizh.opendatasoft.com/api/datasets/1.0/base-de-donnees-multimodale-transports-publics-en-bretagne-mobibreizh/attachments/mobibreizhbret_201706_gtfs_zip
959260 10.74% 114245 16188 https://breizh.opendatasoft.com/api/datasets/1.0/base-de-donnees-multimodale-transports-publics-en-bretagne-mobibreizh/attachments/11_2020_mobibreizhbret_gtfs_zip
953154 10.67% 113066 16189 http://breizh.opendatasoft.com/api/datasets/1.0/base-de-donnees-multimodale-transports-publics-en-bretagne-mobibreizh/attachments/11_2020_mobibreizhbret_gtfs_zip
274778 3.08% 113782 8361 https://data.iledefrance-mobilites.fr/api/v2/catalog/datasets/offre-horaires-tc-gtfs-idf/files/736ca2f956a1b6cc102649ed6fd56d45
191509 2.14% 113623 8781 https://data.centrevaldeloire.fr/api/v2/catalog/datasets/jvmalin-point-dacces-national/files/a98f4cdb41591e3530c1e4f29d39fc53
155512 1.74% 100930 7579 https://www.pigma.org/public/opendata/nouvelle_aquitaine_mobilites/publication/naq-aggregated-gtfs.zip
59756 .67% 113607 9178 https://data.centrevaldeloire.fr/api/v2/catalog/datasets/offre-theorique-mobilite-remi/files/8fdd2d65720750e8064cbfee68426e0f
58506 .66% 104631 10341 http://data.haute-garonne.fr/api/datasets/1.0/lignes-regulieres-format-gtfs/attachments/reseau_lr_gtfs_20200706_zip
49832 .56% 114319 16387 https://ressources.data.sncf.com/api/v2/catalog/datasets/sncf-ter-gtfs/files/24e02fa969496e2caa5863a365c66ec2
39581 .44% 109853 11939 http://data.haute-garonne.fr/api/datasets/1.0/lignes-regulieres-format-gtfs/attachments/reseau_lr_gtfs_20200924_zip
39037 .44% 114445 7840 https://www.pigma.org/public/opendata/nouvelle_aquitaine_mobilites/publication/naq_lim-aggregated-gtfs.zip
37718 .42% 114428 7806 https://www.pigma.org/public/opendata/nouvelle_aquitaine_mobilites/publication/naq_gir-aggregated-gtfs.zip
36156 .40% 100608 9093 http://data.haute-garonne.fr/api/datasets/1.0/lignes-regulieres-format-gtfs/attachments/reseau_lr_gtfs_20200106_zip
35406 .40% 100444 8315 https://static.data.gouv.fr/resources/gtfs-de-la-societe-de-transport-urbain-du-grand-montauban-semtm/20181128-174626/gtfs.zip
35307 .40% 101777 8336 http://data.haute-garonne.fr/api/datasets/1.0/lignes-regulieres-format-gtfs/attachments/reseau_lr_gtfs_20191104_zip
35098 .39% 100586 8322 https://data.mulhouse-alsace.fr/api/datasets/1.0/offre-de-transport-solea-et-tram-train-en-format-gtfs/alternative_exports/sitram_gtfs_2018_2019_zip
30096 .34% 114413 7654 https://opendata.lillemetropole.fr/api/datasets/1.0/transport_arret_transpole-point/alternative_exports/gtfs_zip
27776 .31% 95156 7581 https://www.pigma.org/public/opendata/nouvelle_aquitaine_mobilites/publication/naq-aggregated-netex.zip
27229 .30% 101713 10335 https://static.data.gouv.fr/resources/horaires-theoriques-du-reseau-zoom-le-grand-chalon-gtfs/20200603-143958/gtfs-20200603-01-3-.zip
23419 .26% 107285 10635 https://sig.hautsdefrance.fr/ext/opendata/Transport/GTFS/59/RHDF_GTFS_COM_SCO_59_P1.zip
22496 .25% 113211 16402 https://static.data.gouv.fr/resources/horaires-theoriques-du-reseau-zoom-le-grand-chalon-gtfs-1/20201105-155731/gtfs-20201105-03.zip
22063 .25% 107299 10636 https://sig.hautsdefrance.fr/ext/opendata/Transport/GTFS/59/RHDF_GTFS_COM_SCO_59_P2.zip
19840 .22% 114312 16499 https://ressources.data.sncf.com/api/v2/catalog/datasets/sncf-intercites-gtfs/files/ed829c967a0da1252f02baaf684db32c
19316 .22% 112295 8593 https://trouver.datasud.fr/dataset/44187c20-e037-4733-950a-b4463d314b90/resource/f6342a2c-d02a-405f-9700-6a7121e2e06f/download/gtfs_84.zip
18305 .20% 100654 7906 https://static.data.gouv.fr/resources/offre-de-transports-reseau-dk-bus-de-la-communaute-urbaine-de-dunkerque-gtfs/20190701-034402/gtfs.zip
16947 .19% 114308 8588 https://trouver.datasud.fr/dataset/44187c20-e037-4733-950a-b4463d314b90/resource/db4be056-c7e8-4efb-8299-4b8c6235defe/download/gtfs_06.zip
16711 .19% 113547 11757 https://exs.grandest2.cityway.fr/GTFS.aspx?Key=OPENDATA&OperatorCode=CG68
16617 .19% 113933 10354 https://data.explore.divia.fr/api/datasets/1.0/gtfs-divia-mobilites/attachments/gtfs_diviamobilites_current_zip
16160 .18% 107946 10363 https://exs.grandest2.cityway.fr/GTFS.aspx?Key=OPENDATA&OperatorCode=CG68
15858 .18% 100423 9779 https://static.data.gouv.fr/resources/reseau-taneo-1/20200529-052803/gtfs-lot1-20200417-20201231.zip
15736 .18% 100542 8045 https://static.data.gouv.fr/resources/offre-de-transport-du-reseau-trema-gtfs/20190827-090229/export-2-septembre.zip
15685 .18% 101820 10344 https://static.data.gouv.fr/resources/horaires-du-reseau-ntecc-periode-scolaire-1/20200708-084224/gtfs.zip
15414 .17% 113574 11763 https://exs.grandest2.cityway.fr/GTFS.aspx?Key=OPENDATA&OperatorCode=LIVO
14912 .17% 114412 7634 http://opendata.cts-strasbourg.fr/fichiers/gtfs/google_transit.zip

I have then grabbed data from a large dataset:

And run the validator locally with:

cargo run --release -- --input 12_2020_mobibreizhbret_gtfs.zip > 12_2020_mobibreizhbret_gtfs.validation.json

Finally, I filtered the JSON with jq to get an idea of where the big stuff is going:

cat 12_2020_mobibreizhbret_gtfs.validation.json | jq -c 'path(..)|[.[]|tostring]|join("/")' | sed -e 's/\([0-9]\)/X/g' | sort | uniq -c | sort -rn | grep "validations/CloseStops"
50998 "validations/CloseStops/XXX/related_objects/XX/object_type"
50998 "validations/CloseStops/XXX/related_objects/XX/name"
50998 "validations/CloseStops/XXX/related_objects/XX/id"
50998 "validations/CloseStops/XXX/related_objects/XX"
38169 "validations/CloseStops/XXX/related_objects/XXX/object_type"
38169 "validations/CloseStops/XXX/related_objects/XXX/name"
38169 "validations/CloseStops/XXX/related_objects/XXX/id"
38169 "validations/CloseStops/XXX/related_objects/XXX"
8675 "validations/CloseStops/XXX/related_objects/X/object_type"
8675 "validations/CloseStops/XXX/related_objects/X/name"
8675 "validations/CloseStops/XXX/related_objects/X/id"
8675 "validations/CloseStops/XXX/related_objects/X"
4702 "validations/CloseStops/XX/related_objects/XX/object_type"
4702 "validations/CloseStops/XX/related_objects/XX/name"
4702 "validations/CloseStops/XX/related_objects/XX/id"
4702 "validations/CloseStops/XX/related_objects/XX"
3861 "validations/CloseStops/XX/related_objects/XXX/object_type"
3861 "validations/CloseStops/XX/related_objects/XXX/name"
3861 "validations/CloseStops/XX/related_objects/XXX/id"
3861 "validations/CloseStops/XX/related_objects/XXX"
 900 "validations/CloseStops/XXX/severity"
 900 "validations/CloseStops/XXX/related_objects"
 900 "validations/CloseStops/XXX/object_type"
 900 "validations/CloseStops/XXX/object_name"
 900 "validations/CloseStops/XXX/object_id"

I have also discussed with @antoine-de and indeed here:

I'll resume later to provide a change here.

thbar commented 3 years ago

Also useful query by @antoine-de:

select 
  validations.resource_id, close_stops->>'object_id' as object_id, close_stops->>'details',
    json_array_length(close_stops->'related_objects') as length from validations, 
   json_array_elements(validations.details->'CloseStops') as close_stops
where validations.resource_id is not null order by length desc;
thbar commented 3 years ago

Solved via #105 for now, this reduces the payload x14 for the largest file, and 3 to 4 times for more modest files, so a good improvement.