Azure-Samples / azure-search-dotnet-utilities

C# code samples that help with admin or development tasks in Azure Cognitive Search.
MIT License
18 stars 17 forks source link

Paging does not work for the export #10

Open anev-auror opened 8 months ago

anev-auror commented 8 months ago

When a search contains more results than the page size, the paging did not work for me in the exporter. I only ever got the documents from the first page exported. I am totally new to the search client and fixed the problem by replacing the code in the PartitionExporter.ExportPartitionAsync with the following. I am not sure this is the best approach?

Thanks for the repository, other than this problem it worked very well!

       var options = new SearchOptions
        {
            Filter = partition.Filter,
            Size = _pageSize,
            Skip = 0,
            IncludeTotalCount = true,
        };
        AddSelect(options);
        options.OrderBy.Add($"{_partitionFile.FieldName} asc");

        SearchResults<SearchDocument> searchResults = null;
        do

        {
            searchResults = await _searchClient.SearchAsync<SearchDocument>(searchText: string.Empty, options: options,
                cancellationToken: cancellationToken);

            await _partitionWriter.WritePartitionAsync(partitionId, searchResults, cancellationToken);
            options.Skip += _pageSize;  
        } while (options.Skip < searchResults.TotalCount);
brunodoamaral commented 6 months ago

@anev-auror it almost fix the problem. You should also modify the FilePartitionWriter and IFilePartitionWriter to include a page number. I did this and worked:

        public async Task WritePartitionAsync(int partitionId, int page, SearchResults<SearchDocument> searchResults, CancellationToken cancellationToken)
        {
            if (!Directory.Exists(_directory))
            {
                Directory.CreateDirectory(_directory);
            }

            string exportPath = Path.Combine(_directory, $"{_indexName}-{partitionId:D3}-{page:D3}-documents.json");
            using FileStream exportOutput = File.Open(exportPath, FileMode.OpenOrCreate, FileAccess.Write, FileShare.Read);

            await foreach (Page<SearchResult<SearchDocument>> resultPage in searchResults.GetResultsAsync().AsPages())
            {
                foreach (SearchResult<SearchDocument> searchResult in resultPage.Values)
                {
                    JsonSerializer.Serialize(exportOutput, searchResult.Document);
                    exportOutput.WriteByte((byte)'\n');
                }
            }
        }
mattgotteiner commented 6 months ago

Sorry it's not obvious how to use it. You need to invoke it from the command line like so

 dotnet run partition-index

Description:
  Partitions the data in the index between the upper and lower bound values into partitions with at
  most 100,000 documents.

Usage:
  export-data partition-index [options]

Options:
  --endpoint <endpoint> (REQUIRED)      Endpoint of the search service to export data from. Example:
                                        https://example.search.windows.net
  --admin-key <admin-key>               Admin key to the search service to export data from. If not
                                        specified - uses your Entra identity
  --index-name <index-name> (REQUIRED)  Name of the index to export data from
  --field-name <field-name> (REQUIRED)  Name of field used to partition the index data. This field must
                                        be filterable and sortable.
  --lower-bound <lower-bound>           Smallest value to use to partition the index data. Defaults to
                                        the smallest value in the index. []
  --upper-bound <upper-bound>           Largest value to use to partition the index data. Defaults to
                                        the largest value in the index. []
  --partition-size <partition-size>     Maximum size of a partition. Defaults to 100,000. Cannot exceed
                                        100,000 [default: 100000]
  --partition-path <partition-path>     Path of the file with JSON description of partitions. Should
                                        end in .json. Default is <index name>-partitions.json []
  -?, -h, --help                        Show help and usage information

Once you partition the index, then you can export. With the modifications you have made, the export will not work in parallel

 dotnet run export-partitions

Description:
  Exports data from a search index using a pre-generated partition file from partition-index

Usage:
  export-data export-partitions [options]

Options:
  --partition-path <partition-path> (REQUIRED)     Path of the file with JSON description of
                                                   partitions. Should end in .json.
  --admin-key <admin-key>                          Admin key to the search service to export data from.
                                                   If not specified - uses your Entra identity
  --export-path <export-path>                      Directory to write JSON Lines partition files to.
                                                   Every line in the partition file contains a JSON
                                                   object with the contents of the Search document.
                                                   Format of file names is <index name>-<partition
                                                   id>-documents.json [default: .]
  --concurrent-partitions <concurrent-partitions>  Number of partitions to concurrently export. Default
                                                   is 2 [default: 2]
  --page-size <page-size>                          Page size to use when running export queries.
                                                   Default is 1000 [default: 1000]
  --include-partition <include-partition>          List of partitions by index to include in the
                                                   export. Example: --include-partition 0
                                                   --include-partition 1 only runs the export on first
                                                   2 partitions []
  --exclude-partition <exclude-partition>          List of partitions by index to exclude from the
                                                   export. Example: --exclude-partition 0
                                                   --exclude-partition 1 runs the export on every
                                                   partition except the first 2 []
  --include-field <include-field>                  List of fields to include in the export. Example:
                                                   --include-field field1 --include-field field2. []
  --exclude-field <exclude-field>                  List of fields to exclude in the export. Example:
                                                   --exclude-field field1 --exclude-field field2. []
  -?, -h, --help                                   Show help and usage information

You will unfortunately have to undo your modifications, otherwise it might not work

mattgotteiner commented 6 months ago

I also created a code sample and fixed the reported bug

15