WebCrawler data source cannot be created

mptak-tbscg commented 3 months ago

Hello, I encountered an issue while trying to deploy a new data source to my QBusiness application. Due to the lack of complete documentation, I created a resource via the console and retrieved all details about it using the following command:

aws qbusiness get-data-source --application-id <app-id> --index-id <index-id> --data-source-id <data-source-id>

here is the output of the command:

applicationId: <app-id>
configuration:
  additionalProperties:
    crawlAllDomain: false
    crawlAttachments: false
    crawlDepth: '2'
    crawlDomainsOnly: false
    crawlSubDomain: true
    exclusionFileIndexPatterns: []
    exclusionURLCrawlPatterns: []
    exclusionURLIndexPatterns: []
    honorRobots: true
    includeSupportedFileType: false
    inclusionFileIndexPatterns: []
    inclusionURLCrawlPatterns: []
    inclusionURLIndexPatterns: []
    maxFileSize: '50'
    maxLinksPerUrl: '100'
    proxy: {}
    rateLimit: '300'
  connectionConfiguration:
    repositoryEndpointMetadata:
      authentication: NoAuthentication
      seedUrlConnections:
      - seedUrl: <seed-url>
  enableIdentityCrawler: false
  repositoryConfigurations:
    attachment:
      fieldMappings:
      - dataSourceFieldName: category
        indexFieldName: _category
        indexFieldType: STRING
      - dataSourceFieldName: sourceUrl
        indexFieldName: _source_uri
        indexFieldType: STRING
    webPage:
      fieldMappings:
      - dataSourceFieldName: category
        indexFieldName: _category
        indexFieldType: STRING
      - dataSourceFieldName: sourceUrl
        indexFieldName: _source_uri
        indexFieldType: STRING
  syncMode: FORCED_FULL_CRAWL
  type: WEBCRAWLER
  version: 1.0.0
createdAt: '2024-06-10T12:34:48.898000+02:00'
dataSourceArn: <arn>
dataSourceId: <id>
description: ''
displayName: test
error: {}
indexId: <index-id>
roleArn: <role-arn>
status: ACTIVE
syncSchedule: ''
type: WEBCRAWLER
updatedAt: '2024-06-10T12:34:48.898000+02:00'

I tried to use the output to create a YAML template, so I copied the configuration section and created something like this:

QChatWebDataSource:
    Type: AWS::QBusiness::DataSource
    Properties:
      ApplicationId: !Ref QChatBusinessApp
      Configuration:
        additionalProperties:
          crawlDepth: 2
          maxFileSize: 50
          maxLinksPerUrl: 100
          rateLimit: 300
          proxy: {}
          exclusionURLCrawlPatterns: []
          exclusionURLIndexPatterns: []
          exclusionFileIndexPatterns: []
          inclusionURLCrawlPatterns: []
          inclusionURLIndexPatterns: []
          inclusionFileIndexPatterns: []
          includeSupportedFileType: false
          crawlSubDomain: true
          crawlAllDomain: false
          honorRobots: true
        connectionConfiguration:
          repositoryEndpointMetadata:
            authentication: NoAuthentication
            seedUrlConnections:
              - seedUrl: <url>
        enableIdentityCrawler: false
        repositoryConfigurations:
          attachment:
            fieldMappings:
              - dataSourceFieldName: category
                indexFieldName: _category
                indexFieldType: STRING
              - dataSourceFieldName: sourceUrl
                indexFieldName: _source_uri
                indexFieldType: STRING
          webPage:
            fieldMappings:
              - dataSourceFieldName: category
                indexFieldName: _category
                indexFieldType: STRING
              - dataSourceFieldName: sourceUrl
                indexFieldName: _source_uri
                indexFieldType: STRING
        syncMode: FULL_CRAWL
        type: WEBCRAWLER
        version: 1.0.0
      Description: Website indexer
      DisplayName: !Sub "${SolutionPrefix}-web-ds"
      IndexId: !GetAtt QChatBusinessNativeIndex.IndexId
      RoleArn: !GetAtt QChatWebDataSourceRole.Arn
      SyncSchedule: "cron(33 22 ? * SAT *)"

I am not able to deploy it using the SAM framework.

Despite using booleans or even some conditions like !Condition [1, 1], I am still encountering the same issue. After conducting some research, I found this file:

https://github.com/aws-cloudformation/aws-cloudformation-resource-provider-qbusiness/blob/main/aws-qbusiness-datasource/src/main/java/software/amazon/qbusiness/datasource/translators/DocumentConverter.java

The method convertToMapToDocument is used to convert the configuration into a Document by using ImmutableMap. My guess is that by doing so, we are probably mapping everything to strings, which is why the errors appear. Correct me if I'm wrong. Thank you in advance.

jregistr commented 2 months ago

The method convertToMapToDocument is used to convert the configuration into a Document by using ImmutableMap. My guess is that by doing so, we are probably mapping everything to strings, which is why the errors appear. Correct me if I'm wrong. Thank you in advance.

Hi. We've investigated this. Your hunch is heading in the right direction. We consulted with the CFN team and found that boolean values are received by handlers as strings in a free form JSON like our Configuration object. This is due to how yaml and json are handled.

We've fixed this error by adding additional logic to parse boolean strings. See the PR: https://github.com/aws-cloudformation/aws-cloudformation-resource-provider-qbusiness/pull/56

jregistr commented 2 months ago

I copied and slightly modified your yaml template and confirmed it can be created successfully

Resources:
  QChatWebDataSource:
      Type: AWS::QBusiness::DataSource
      Properties:
        ApplicationId: !ImportValue TheApplicationID
        Configuration:
          additionalProperties:
            crawlDepth: 2
            maxFileSize: 50
            maxLinksPerUrl: 100
            rateLimit: 300
            proxy: {}
            exclusionURLCrawlPatterns: []
            exclusionURLIndexPatterns: []
            exclusionFileIndexPatterns: []
            inclusionURLCrawlPatterns: []
            inclusionURLIndexPatterns: []
            inclusionFileIndexPatterns: []
            includeSupportedFileType: false
            crawlSubDomain: true
            crawlAllDomain: false
            honorRobots: true
          connectionConfiguration:
            repositoryEndpointMetadata:
              authentication: NoAuthentication
              seedUrlConnections:
                - seedUrl: "https://en.wikipedia.org/wiki/Dijkstra's_algorithm"
          enableIdentityCrawler: false
          repositoryConfigurations:
            attachment:
              fieldMappings:
                - dataSourceFieldName: category
                  indexFieldName: _category
                  indexFieldType: STRING
                - dataSourceFieldName: sourceUrl
                  indexFieldName: _source_uri
                  indexFieldType: STRING
            webPage:
              fieldMappings:
                - dataSourceFieldName: category
                  indexFieldName: _category
                  indexFieldType: STRING
                - dataSourceFieldName: sourceUrl
                  indexFieldName: _source_uri
                  indexFieldType: STRING
          syncMode: FULL_CRAWL
          type: WEBCRAWLER
          version: 1.0.0
        Description: Website indexer
        DisplayName: "I-web-ds"
        IndexId: !ImportValue TheIndexId
        RoleArn: !ImportValue TheDataSourceRoleArn
        SyncSchedule: "cron(33 22 ? * SAT *)"

jregistr commented 2 months ago

@mptak-tbscg I merged the fix for this issue. Please feel free to re-open if you continue to experience the same problem.

aws-cloudformation / aws-cloudformation-resource-provider-qbusiness

WebCrawler data source cannot be created #54