Open anev-auror opened 8 months ago
@anev-auror it almost fix the problem. You should also modify the FilePartitionWriter and IFilePartitionWriter to include a page number. I did this and worked:
public async Task WritePartitionAsync(int partitionId, int page, SearchResults<SearchDocument> searchResults, CancellationToken cancellationToken)
{
if (!Directory.Exists(_directory))
{
Directory.CreateDirectory(_directory);
}
string exportPath = Path.Combine(_directory, $"{_indexName}-{partitionId:D3}-{page:D3}-documents.json");
using FileStream exportOutput = File.Open(exportPath, FileMode.OpenOrCreate, FileAccess.Write, FileShare.Read);
await foreach (Page<SearchResult<SearchDocument>> resultPage in searchResults.GetResultsAsync().AsPages())
{
foreach (SearchResult<SearchDocument> searchResult in resultPage.Values)
{
JsonSerializer.Serialize(exportOutput, searchResult.Document);
exportOutput.WriteByte((byte)'\n');
}
}
}
Sorry it's not obvious how to use it. You need to invoke it from the command line like so
dotnet run partition-index
Description:
Partitions the data in the index between the upper and lower bound values into partitions with at
most 100,000 documents.
Usage:
export-data partition-index [options]
Options:
--endpoint <endpoint> (REQUIRED) Endpoint of the search service to export data from. Example:
https://example.search.windows.net
--admin-key <admin-key> Admin key to the search service to export data from. If not
specified - uses your Entra identity
--index-name <index-name> (REQUIRED) Name of the index to export data from
--field-name <field-name> (REQUIRED) Name of field used to partition the index data. This field must
be filterable and sortable.
--lower-bound <lower-bound> Smallest value to use to partition the index data. Defaults to
the smallest value in the index. []
--upper-bound <upper-bound> Largest value to use to partition the index data. Defaults to
the largest value in the index. []
--partition-size <partition-size> Maximum size of a partition. Defaults to 100,000. Cannot exceed
100,000 [default: 100000]
--partition-path <partition-path> Path of the file with JSON description of partitions. Should
end in .json. Default is <index name>-partitions.json []
-?, -h, --help Show help and usage information
Once you partition the index, then you can export. With the modifications you have made, the export will not work in parallel
dotnet run export-partitions
Description:
Exports data from a search index using a pre-generated partition file from partition-index
Usage:
export-data export-partitions [options]
Options:
--partition-path <partition-path> (REQUIRED) Path of the file with JSON description of
partitions. Should end in .json.
--admin-key <admin-key> Admin key to the search service to export data from.
If not specified - uses your Entra identity
--export-path <export-path> Directory to write JSON Lines partition files to.
Every line in the partition file contains a JSON
object with the contents of the Search document.
Format of file names is <index name>-<partition
id>-documents.json [default: .]
--concurrent-partitions <concurrent-partitions> Number of partitions to concurrently export. Default
is 2 [default: 2]
--page-size <page-size> Page size to use when running export queries.
Default is 1000 [default: 1000]
--include-partition <include-partition> List of partitions by index to include in the
export. Example: --include-partition 0
--include-partition 1 only runs the export on first
2 partitions []
--exclude-partition <exclude-partition> List of partitions by index to exclude from the
export. Example: --exclude-partition 0
--exclude-partition 1 runs the export on every
partition except the first 2 []
--include-field <include-field> List of fields to include in the export. Example:
--include-field field1 --include-field field2. []
--exclude-field <exclude-field> List of fields to exclude in the export. Example:
--exclude-field field1 --exclude-field field2. []
-?, -h, --help Show help and usage information
You will unfortunately have to undo your modifications, otherwise it might not work
I also created a code sample and fixed the reported bug
When a search contains more results than the page size, the paging did not work for me in the exporter. I only ever got the documents from the first page exported. I am totally new to the search client and fixed the problem by replacing the code in the PartitionExporter.ExportPartitionAsync with the following. I am not sure this is the best approach?
Thanks for the repository, other than this problem it worked very well!