catalyst-cooperative / pudl-archiver

A tool for capuring snapshots of public data sources and archiving them on Zenodo for programmatic use.
MIT License
4 stars 1 forks source link

Duplicate filename warnings while archiving FERC XBRL data #311

Open zaneselvans opened 6 months ago

zaneselvans commented 6 months ago

I tried running the FERC Form 1 archiver locally and saw a number of warnings about duplicate filenames in zipfiles. E.g.

UserWarning: Duplicate name: 'System_Energy_Resources,_Inc._form1_Q4_1702681857.xbrl'
UserWarning: Duplicate name: 'System_Energy_Resources,_Inc._form1_Q4_1702684358.xbrl'
UserWarning: Duplicate name: 'System_Energy_Resources,_Inc._form1_Q4_1702685900.xbrl'
UserWarning: Duplicate name: 'NextEra_Energy_Transmission_New_York,_Inc._form1_Q4_1708143331.xbrl'
UserWarning: Duplicate name: 'NorthWestern_Corporation_form1_Q4_1709182516.xbrl'

Do we expect there to be filename collisions?

zschira commented 5 months ago

I noticed this when I was recently working on the FERC archivers, and what's happening is that the main RSS feed contains only the most recent filings, while older filings can only be found in month specific feeds. This leads to some collisions where recent filings are available in a month specific feed, and the main feed. They should be the exact same filing, it shouldn't really be a problem, but I think it would be best to fix this and raise an error if we see unexpected duplicates.