apache / drill

Apache Drill is a distributed MPP query layer for self describing data
https://drill.apache.org/
Apache License 2.0
1.93k stars 979 forks source link

Cannot query sitemap.xml files #2899

Closed blackerby closed 5 months ago

blackerby commented 5 months ago

Describe the bug Querying sitemaps file "succeeds" but the web UI says no data is available.

To Reproduce Steps to reproduce the behavior:

  1. Create a local copy of https://www.govinfo.gov/sitemap/bulkdata/PLAW/118publ/sitemap.xml on your hard drive.
  2. Query the file
    select * from dfs.`<path_to_file>/sitemap.xml`;
  3. See error in the screenshot below

Expected behavior I expect to see data in the output table.

Error detail, log output or screenshots Screenshot 2024-04-10 at 8 50 00 PM

Drill version 1.21.1

(cc: @cgivre)

blackerby commented 5 months ago

Similar issue when I try to query this MODS file

https://www.govinfo.gov/metadata/pkg/PLAW-106publ246/mods.xml

blackerby commented 5 months ago

Removing all attributes from the root elements in the sitemap and MODS files results in data being returned

Copy/paste isn't really cooperating with my MODS example, but here is the output after removing attributes from the root element of the sitemap file.

apache drill> select * from  dfs.`/Users/wm/Desktop/sitemap.xml` t;
+------------+----------------------------------------------------------------------------------+
| attributes |                                       url                                        |
+------------+----------------------------------------------------------------------------------+
| {}         | {"loc":"https://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ1.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ2.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ3.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ4.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ5.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ6.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ8.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ7.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ9.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ11.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ12.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ10.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ13.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ14.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ16.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ18.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ19.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ17.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ15.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ21.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ20.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ22.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ23.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ30.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ28.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ29.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ26.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ25.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ24.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ27.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ34.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ33.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ32.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ35.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ037.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ36.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ37.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ38.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ39.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ31.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ40.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ41.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ45.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ44.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ46.xmlhttps://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ43.xml","lastmod":"2024-03-28T00:10:00.074Z2023-06-20T23:44:00.215Z2023-07-03T14:32:01.529Z2023-08-29T19:17:07.501Z2023-12-11T20:04:05.659Z2023-08-29T19:18:04.606Z2023-08-29T19:45:02.701Z2023-12-11T19:11:06.925Z2023-12-11T19:12:10.757Z2024-02-15T16:27:03.296Z2023-12-11T19:47:04.743Z2024-02-15T16:27:03.236Z2024-02-15T16:27:03.265Z2024-02-15T15:44:41.386Z2023-12-11T20:04:05.922Z2023-12-11T20:05:08.683Z2023-12-12T00:49:02.710Z2024-02-15T15:44:41.414Z2024-02-15T15:44:41.443Z2024-02-15T15:44:41.471Z2023-11-21T16:53:01.237Z2023-12-07T23:10:02.343Z2024-01-04T16:42:00.154Z2024-01-04T16:42:00.247Z2024-01-04T16:42:00.275Z2024-01-04T16:42:00.340Z2024-01-04T16:42:00.369Z2024-01-04T16:42:00.397Z2024-01-09T15:48:00.645Z2024-02-14T00:54:00.081Z2024-02-07T19:55:17.124Z2024-01-22T13:57:00.039Z2024-01-22T19:17:00.607Z2024-01-24T22:17:22.372Z2024-02-02T13:21:00.064Z2024-02-02T13:21:00.142Z2024-02-12T20:18:53.802Z2024-02-15T13:17:00.671Z2024-02-21T13:38:00.160Z2024-03-14T16:09:00.050Z2024-03-11T12:59:00.649Z2024-03-15T12:16:00.039Z2024-03-29T12:26:00.117Z2024-03-29T12:26:00.173Z2024-03-29T12:26:00.209Z2024-04-01T12:39:00.068Z","changefreq":"monthlymonthlymonthlymonthlymonthlymonthlymonthlymonthlymonthlymonthlymonthlymonthlymonthlymonthlymonthlymonthlymonthlymonthlymonthlymonthlymonthlymonthlymonthlymonthlymonthlymonthlymonthlymonthlymonthlymonthlymonthlymonthlymonthlymonthlymonthlymonthlymonthlymonthlymonthlymonthlymonthlymonthlymonthlymonthlymonthlymonthly","priority":"1.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.01.0"} |
cgivre commented 5 months ago

@blackerby I found the cause for this issues and submitted a fix. Would you mind please testing it out and submitting a review?

blackerby commented 5 months ago

@cgivre Sure thing, should be able to do so in the next day or two

blackerby commented 5 months ago

Alright, got your fork up and running on my machine. Borrowing a little syntax from the added test (I'm still learning Drill), this worked perfectly:

apache drill> select * from table(dfs.`/Users/wm/Desktop/sitemap.xml` (type => 'xml', dataLevel => 2)) limit 5;
+------------+--------------------------------------------------------------------+--------------------------+------------+----------+
| attributes |                                loc                                 |         lastmod          | changefreq | priority |
+------------+--------------------------------------------------------------------+--------------------------+------------+----------+
| {}         | https://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ1.xml | 2024-03-28T00:10:00.074Z | monthly    | 1.0      |
| {}         | https://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ2.xml | 2023-06-20T23:44:00.215Z | monthly    | 1.0      |
| {}         | https://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ3.xml | 2023-07-03T14:32:01.529Z | monthly    | 1.0      |
| {}         | https://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ4.xml | 2023-08-29T19:17:07.501Z | monthly    | 1.0      |
| {}         | https://www.govinfo.gov/bulkdata/PLAW/118/public/PLAW-118publ5.xml | 2023-12-11T20:04:05.659Z | monthly    | 1.0      |
+------------+--------------------------------------------------------------------+--------------------------+------------+----------+
5 rows selected (0.151 seconds)

No such luck with a MODS file though

apache drill> select * from dfs.`/Users/wm/Desktop/mods.xml`;
+------------+------+----------------+-------+----------+-----------+-----------+------------+---------------------+----------------+------------+------------+-----------------+-----------------+--------------------+---------------+---------+---------+-------+----------+
| attributes | name | typeOfResource | genre | language | extension | titleInfo | originInfo | physicalDescription | classification | identifier | recordInfo | accessCondition | isAppropriation | legislativeHistory | congCommittee | section | chapter | pages | location |
+------------+------+----------------+-------+----------+-----------+-----------+------------+---------------------+----------------+------------+------------+-----------------+-----------------+--------------------+---------------+---------+---------+-------+----------+
No rows selected (0.147 seconds)

Happy to add this as comment on #2908, but wanted to comment here in case there's something I'm missing as a brand new Drill user.