adsabs / ADSDocMatchPipeline

Pipeline to match publisher document with preprint counterpart and vice versa
MIT License
1 stars 4 forks source link

Sending updates to Oracle #40

Open golnazads opened 1 month ago

golnazads commented 1 month ago

Docmatching pipeline sends requests to the Oracle service to insert data into a database. I have learned that sometimes these requests may contain duplicate entries. Since the microservice uses bulk insert operations, the presence of duplicates in the list causes the insert to fail.

I propose to separate the duplicate entities at the pipeline level, before sending the data to the Oracle. My main concern is the service's 60-second time limit. If the operation exceeds this time limit, even though the service would continue to update the database, from the pipeline's perspective, the request is considered to have failed.

By handling duplicates at the pipeline level, we reduce the processing load on the microservice and minimize data transfer, which is crucial given the AWS setup with multiple layers (NGINX, Kubernetes). Additionally, processing the data at the pipeline level ensures that only clean, properly structured data reaches the microservice, preventing bulk insert failures. This approach also aligns with the principle that data preparation should occur before reaching the service layer, leaving the service layer responsible solely for inserting data into the database as requested. Furthermore, it's easier to implement robust error handling and logging at the pipeline level, allowing for better troubleshooting and data quality control.

I've also recognized that the volume of data sent to the service can grow, and I'm proposing to send the updates in batches. This approach is similar to the process already in place between the resolver service and master pipeline. The resolver service does not accept any updates exceeding 1000 rows (configurable parameter), so the master pipeline always breaks down the updates into batches of 1000 rows or fewer before sending them to the resolver for database insertion.

aaccomazzi commented 1 month ago

Sounds like a good plan to me. Any objections @ehenneken @seasidesparrow ?

golnazads commented 3 weeks ago

For your information, added issue to hopefully work on it in the next sprint, to limit number of updates oracle accepts https://github.com/adsabs/oracle_service/issues/74

ehenneken commented 3 weeks ago

@aaccomazzi @golnazads @seasidesparrow I know to little about the implementation details to give much feedback, but it sounds good to me. The observation If the operation exceeds this time limit, even though the service would continue to update the database, from the pipeline's perspective, the request is considered to have failed made me wonder whether the pipeline captures and reports this in a way that's clear what happened.

golnazads commented 3 weeks ago

No, Steve found the situation very confusing. The issue was that if a command took too long to reach the service, it would time out, causing the layers to return an error to the pipeline, not the service. Meanwhile, the service continued processing the request, updating the database, which required locking the table. As a result, any subsequent calls to the service were rejected because the table was still locked, preventing any updates. In the logs, you would see a series of 503 errors followed by one or more 200 success codes. Our understanding was that the first 503 and the first 200 after the errors were related to the same call—the error came from the service layers, and the success came from the service itself. In the end, it was unclear which calls had succeeded and which had failed. The log only indicated success or failure and the length of the request. So, Steve would resend all the rows to the service for updating. The peculiar thing was that sometimes everything would be fine on the next attempt, but other times the next attempt would still result in errors. He experimented with the number of rows to send and found that 1,000 rows was the right number. It has been a few years now, and no error, knock on wood, lol.

seasidesparrow commented 6 hours ago

Regarding updates to oracle from curated files, the current release of the code does this, see https://github.com/adsabs/ADSDocMatchPipeline/blob/093ce198b707ce305b2f2deca85b7a7b0003af90/adsdocmatch/oracle_util.py#L624

It loads the entire file, checks for duplicates and moves them to a retry array, sends the non-duplicates to oracle, and then repeats the process with the contents of the retry array. Along with that, I also implemented a loop that send rows one at a time which eliminates issues with problematic records (esoar, EaArX) failing an entire block in bulk load insert. This results in substantially more calls to oracle service when uploading large curated files, but we do not do that on a regular basis anymore.