Replicate serverless MS MARCO passage

lintool commented 4 years ago

Let's see if we can replicate serverless MS MARCO passage? Tricky bit here is ingesting the documents in DynamoDB is a reasonable amount of time and in a robust manner...

adamyy commented 4 years ago

The major challenges I found is that DynamoDB itself does not give you a friendly API for ingesting huge amount of data into it.

The batch_write_item interface limits the amount of data per request to 16MB and limits the number of records to 25. MS MARCO passage has ~8M documents and that translates to 350,000 round trips.

In my experiments of ACL anthology (a ~80MB collection with 58000 records), I parallelized the batch insert using a thread pool that has a maximum 5 workers. And I configured DynamoDB's write capacity to be "on-demand" so it scales up automatically. It still took ~90 seconds to complete the ingestion.

MS MARCO passage is a collection of 8841823 records, so if it doesn't fail, it will take ~4 hours to upload. Obviously this does not scale too well for larger collections... In this case the number of records is what is capping the throughput.

One official example I found online uses S3 and Lambda to achieve this, the idea is to stage the data in S3 first, then use Lambda to ingest it into DynamoDB. Another one involves the use of AWS Data Pipeline that presumably uses Hadoop/Spark to achieve this. Neither feels like a perfect fit to me, as we are only trying to perform a simple one-time import, but both methods mentioned involve some sort of continuous ingestion (feels like overkill).

lintool commented 4 years ago

The major challenges I found is that DynamoDB itself does not give you a friendly API for ingesting huge amount of data into it.

The batch_write_item interface limits the amount of data per request to 16MB and limits the number of records to 25. MS MARCO passage has ~8M documents and that translates to 350,000 round trips.

In my experiments of ACL anthology (a ~80MB collection with 58000 records), I parallelized the batch insert using a thread pool that has a maximum 5 workers. And I configured DynamoDB's write capacity to be "on-demand" so it scales up automatically. It still took ~90 seconds to complete the ingestion.

MS MARCO passage is a collection of 8841823 records, so if it doesn't fail, it will take ~4 hours to upload. Obviously this does not scale too well for larger collections... In this case the number of records is what is capping the throughput.

Is this "in theory" or have you successfully done this?

Either way, can we write some documentation on how to do this ingest?

One official example I found online uses S3 and Lambda to achieve this, the idea is to stage the data in S3 first, then use Lambda to ingest it into DynamoDB. Another one involves the use of AWS Data Pipeline that presumably uses Hadoop/Spark to achieve this. Neither feels like a perfect fit to me, as we are only trying to perform a simple one-time import, but both methods mentioned involve some sort of continuous ingestion (feels like overkill).

I started looking into the AWS Data Pipeline, and indeed concluded it to be overkill.

adamyy commented 4 years ago

Is this "in theory" or have you successfully done this?

Either way, can we write some documentation on how to do this ingest?

I have not done this yet. It requires some changes to the batch insert python scripts to accommodate the MARCO passage collection. I will documentations about it once I figure out a proper way to ingest.

castorini / anlessini

Replicate serverless MS MARCO passage #5