Closed antoine-de closed 3 years ago
Preliminary notes, in brain dump mode for now:
I have first looked at the transport-site database, which contains history for validations, using:
SELECT
validation_json_size,
to_char(100 * validation_json_size::float / (SUM(validation_json_size) OVER()) , '999D99%') as ratio,
subquery.id,
resource_id,
url
FROM (SELECT *, pg_column_size(details) as validation_json_size from validations) as subquery
INNER JOIN resource r on r.id = resource_id
WHERE validation_json_size IS NOT NULL
ORDER by validation_json_size desc
On a recent database, this gives (before optimisation - just the top extract):
I have then grabbed data from a large dataset:
And run the validator locally with:
cargo run --release -- --input 12_2020_mobibreizhbret_gtfs.zip > 12_2020_mobibreizhbret_gtfs.validation.json
Finally, I filtered the JSON with jq to get an idea of where the big stuff is going:
cat 12_2020_mobibreizhbret_gtfs.validation.json | jq -c 'path(..)|[.[]|tostring]|join("/")' | sed -e 's/\([0-9]\)/X/g' | sort | uniq -c | sort -rn | grep "validations/CloseStops"
50998 "validations/CloseStops/XXX/related_objects/XX/object_type"
50998 "validations/CloseStops/XXX/related_objects/XX/name"
50998 "validations/CloseStops/XXX/related_objects/XX/id"
50998 "validations/CloseStops/XXX/related_objects/XX"
38169 "validations/CloseStops/XXX/related_objects/XXX/object_type"
38169 "validations/CloseStops/XXX/related_objects/XXX/name"
38169 "validations/CloseStops/XXX/related_objects/XXX/id"
38169 "validations/CloseStops/XXX/related_objects/XXX"
8675 "validations/CloseStops/XXX/related_objects/X/object_type"
8675 "validations/CloseStops/XXX/related_objects/X/name"
8675 "validations/CloseStops/XXX/related_objects/X/id"
8675 "validations/CloseStops/XXX/related_objects/X"
4702 "validations/CloseStops/XX/related_objects/XX/object_type"
4702 "validations/CloseStops/XX/related_objects/XX/name"
4702 "validations/CloseStops/XX/related_objects/XX/id"
4702 "validations/CloseStops/XX/related_objects/XX"
3861 "validations/CloseStops/XX/related_objects/XXX/object_type"
3861 "validations/CloseStops/XX/related_objects/XXX/name"
3861 "validations/CloseStops/XX/related_objects/XXX/id"
3861 "validations/CloseStops/XX/related_objects/XXX"
900 "validations/CloseStops/XXX/severity"
900 "validations/CloseStops/XXX/related_objects"
900 "validations/CloseStops/XXX/object_type"
900 "validations/CloseStops/XXX/object_name"
900 "validations/CloseStops/XXX/object_id"
I have also discussed with @antoine-de and indeed here:
I'll resume later to provide a change here.
Also useful query by @antoine-de:
select
validations.resource_id, close_stops->>'object_id' as object_id, close_stops->>'details',
json_array_length(close_stops->'related_objects') as length from validations,
json_array_elements(validations.details->'CloseStops') as close_stops
where validations.resource_id is not null order by length desc;
Solved via #105 for now, this reduces the payload x14 for the largest file, and 3 to 4 times for more modest files, so a good improvement.
I think we should never link to several
Trip
in the related objects as this can make the number of related objects explode.For instance instead of listing the trips that make 2 stops being too close maybe we can just add the
Route
as the related object (and the number of trip ? and / or an example of a trip on this route that causes a problem ?)linked to https://github.com/etalab/transport-site/issues/1360