docs scraper not handling markdown with omitted optional yaml front matter

displague commented 1 year ago

What happened?

While investing scraper for https://github.com/crossplane-contrib/provider-jet-equinix/pull/18 and processing https://raw.githubusercontent.com/equinix/terraform-provider-equinix/master/docs/resources/equinix_ecx_l2_connection.md,

scraper: error: Failed to scrape Terraform provider metadata: cannot scrape Terraform registry: failed to scrape resource metadata from path: ../.work/equinix/equinix/docs/resources/equinix_ecx_l2_connection.md: failed to find the prelude of the document using the xpath expressions: //text()[contains(., "description") and contains(., "page_title")]

According to https://developer.hashicorp.com/terraform/registry/providers/docs#yaml-frontmatter, page_title is optional.

Does scraper know how to handle the Markdown documents and formatting described here? https://developer.hashicorp.com/terraform/registry/providers/docs#format

How can we reproduce it?

Checkout https://github.com/crossplane-contrib/provider-jet-equinix/pull/18/commits/97eb823e70fd751ff0265128a6db4dfbad9d8909

run make generate

jaylevin commented 1 year ago

Same issue here with Confluent Kafka Provider.

scraper: error: Failed to scrape Terraform provider metadata: cannot scrape Terraform registry: failed to scrape resource metadata from path: ../.work/confluentinc/confluent/docs/resources/confluent_ksql_cluster.md: failed to find the prelude of the document using the xpath expressions: //text()[contains(., "description") and contains(., "page_title")]

ADustyOldMuffin commented 1 year ago

I'm also hitting this issue, is there a way around it?

ulucinar commented 1 year ago

Hi @displague, Could you please try overriding the default value of the --prelude-xpath command-line argument in apis/generate.go with something like:

//go:generate go run github.com/upbound/upjet/cmd/scraper -n ${TERRAFORM_PROVIDER_SOURCE} -r ../.work/${TERRAFORM_PROVIDER_SOURCE}/${TERRAFORM_DOCS_PATH} -o ../config/provider-metadata.yaml --prelude-xpath "//text()[contains(., \"subcategory\")]"

displague commented 1 year ago

@ulucinar This seems to have unblocked the build. I do see new code documentation bugs after making this change. Some lines are duplicated and some lines are pulled from the wrong section of the Terraform docs.

ulucinar commented 1 year ago

Hi @displague, We had a recent change in the scraper that was motivated by fixing a case in upbound/provider-gcp. I don't expect it to address the duplication or the wrong-section issue you mentioned above but still, if you would like to give it a try, you may do so by updating your upjet dependency and adding the optional command-line argument --resource-prefix equinix to the generate comment like we do here. Sorry for the inconvenience. It's quite challenging for one scraper to be able to handle all the cases.

cedws commented 1 year ago

Getting this issue trying to generate from the Fastly provider. It is erroring on empty markdown files.

https://github.com/fastly/terraform-provider-fastly/blob/5c16f5639f7ebbb4ea637172131bfbb16453959b/docs/resources/arguments/package.md

https://github.com/fastly/terraform-provider-fastly/blob/5c16f5639f7ebbb4ea637172131bfbb16453959b/docs/resources/components/footer.md

The only workaround I have at the moment is to delete these files.

crossplane / upjet

docs scraper not handling markdown with omitted optional yaml front matter #155

What happened?

How can we reproduce it?