digital-services-act / transparency-database

DSA Transparency Database
GNU General Public License v2.0
39 stars 7 forks source link

`existing-puid` is using Opensearch, making it hard to use in real use cases #435

Open vmaurin opened 1 month ago

vmaurin commented 1 month ago

Looking at the source, it sounds that existing-puid is based on Opensearch, defeating a pattern where one check if a PUID exist before posting it or giving people the opportunity to look up to a submitted SoR.

Also as it seems the data is not indexed continuously, it makes it very complicated to check a statement of reason just sent.

Could it be possible to be based on a database table instead ? (like adding an index on the statement table to search by platform_id, pid ?) I understand that there is an archiving system in place on this table, but maybe the first try should be to hit this table first, to be consistent with the store method

alainvd commented 1 month ago

Dear Vincent,

Due to the statements table being excessively large with 20 billion records, the index checks were causing instability and had to be removed. We have now implemented a two-layer protection system to prevent duplicate statements, using a Redis cache and a database query in a specific table.

If you send a duplicate statement, you will receive a 422 error code indicating that this statement of reasons is already known. This should enable you to save the submission state on your side to avoid resending the same statements repeatedly.

vmaurin commented 1 month ago

Hi @alainvd

Thank you for your fast response !

For sending them, we follow your guidance already, and it seems to behave well.

The issue is about /api/v1/statement/existing-puid/<PUID> documented here https://transparency.dsa.ec.europa.eu/page/api-documentation#existing-puid

The documentation is stating

There is an end point that will allow you to check if a PUID value is already used.

But that is not really true due to querying the elasticsearch index. If it is impossible to fix the behavior for performance reasons, maybe the documentation should be updated ?

Something like

There is an end point that will allow you get a SoR by PUID. Note that SoR are will be available to this endpoint after X hours, or after midnight the day they were submitted

For a user perspective, with the current documentation, it is not clear that it is a different "database", and also there is no clear indication about the indexing frequency/scheduling