Mismatched recordings and transcriptions

huangruizhe commented 2 years ago

Hi, thanks for your efforts in providing the datasets online. It's a lot of work. When we were working with this dataset, we found some issues with the mapping between recording and text.

According to the description:

In the processed text transcript, each line is a sentence of CEO, ordered by time. ... Each line in the text file corresponds to an audio file. And the total number of lines in a text file is the same as the total number of audio files for the same earnings conference call.

But in fact, we found this is not the case. Among 575 earnings conference calls, half of them have unequal length of transcription and number of audio files. This makes recovering the correct audio-text mapping hard. I hope to raise the issue here and let people be aware of it. Hope we could discuss how to solve it.

ID | Company Directory | # Lines of Text | # Audios | ABS(Diff) -- | -- | -- | -- | -- 1 | Illumina Inc_20170801 | 40 | 173 | 133 2 | Waste Management Inc._20170726 | 206 | 79 | 127 3 | Waste Management Inc._20170216 | 157 | 49 | 108 4 | Roper Technologies_20170209 | 183 | 285 | 102 5 | DTE Energy Co._20170726 | 89 | 0 | 89 6 | TE Connectivity Ltd._20170125 | 216 | 131 | 85 7 | Twitter, Inc._20170727 | 99 | 183 | 84 8 | Twitter, Inc._20170426 | 123 | 49 | 74 9 | CIGNA Corp._20170804 | 183 | 120 | 63 10 | The Clorox Company_20170503 | 64 | 12 | 52 11 | A.O. Smith Corp_20170726 | 62 | 114 | 52 12 | Foot Locker Inc_20170224 | 93 | 144 | 51 13 | Humana Inc._20170802 | 162 | 118 | 44 14 | Illinois Tool Works_20170424 | 84 | 124 | 40 15 | Biogen Inc._20170126 | 106 | 141 | 35 16 | Dover Corp._20170720 | 97 | 113 | 16 17 | eBay Inc._20170419 | 26 | 10 | 16 18 | BorgWarner_20170727 | 83 | 69 | 14 19 | Wec Energy Group Inc_20170201 | 184 | 172 | 12 20 | Fortune Brands Home & Security_20170802 | 234 | 243 | 9 21 | Edwards Lifesciences_20170201 | 289 | 297 | 8 22 | AMETEK Inc._20170207 | 353 | 359 | 6 23 | Hologic_20171108 | 264 | 269 | 5 24 | Hologic_20170802 | 251 | 255 | 4 25 | ResMed_20170427 | 252 | 256 | 4 26 | Snap-On Inc._20170202 | 300 | 296 | 4 27 | PACCAR Inc._20170725 | 144 | 147 | 3 28 | Aon plc_20170509 | 228 | 231 | 3 29 | AmerisourceBergen Corp_20171102 | 211 | 214 | 3 30 | Western Union Co_20170502 | 205 | 208 | 3 31 | Walmart_20171116 | 193 | 196 | 3 32 | Ford Motor_20170427 | 326 | 329 | 3 33 | Xcel Energy Inc_20170202 | 219 | 222 | 3 34 | Quanta Services Inc._20170221 | 194 | 197 | 3 35 | Kraft Heinz Co_20170503 | 91 | 94 | 3 36 | Grainger (W.W.) Inc._20170719 | 273 | 276 | 3 37 | Skyworks Solutions_20171106 | 236 | 239 | 3 38 | Hormel Foods Corp._20170223 | 257 | 260 | 3 39 | Celgene Corp._20170427 | 121 | 124 | 3 40 | Exxon Mobil Corp._20171027 | 448 | 451 | 3 41 | AT&T Inc._20170425 | 225 | 228 | 3 42 | Coca-Cola Company (The)_20170726 | 235 | 237 | 2 43 | Rockwell Automation Inc._20170726 | 199 | 197 | 2 44 | American Express Co_20171018 | 231 | 233 | 2 45 | Fortive Corp_20170207 | 275 | 277 | 2 46 | Red Hat Inc._20171219 | 84 | 86 | 2 47 | Hanesbrands Inc_20171101 | 52 | 50 | 2 48 | Symantec Corp._20170510 | 252 | 250 | 2 49 | Stericycle Inc_20170504 | 57 | 59 | 2 50 | Hasbro Inc._20171023 | 221 | 223 | 2 51 | FleetCor Technologies Inc_20170803 | 157 | 159 | 2 52 | Varian Medical Systems_20171025 | 129 | 131 | 2 53 | Home Depot_20170221 | 141 | 143 | 2 54 | Hasbro Inc._20170206 | 255 | 257 | 2 55 | Comerica Inc._20170418 | 117 | 119 | 2 56 | JPMorgan Chase & Co._20170714 | 230 | 232 | 2 57 | Ball Corp_20170803 | 220 | 222 | 2 58 | Caterpillar Inc._20170126 | 281 | 279 | 2 59 | Xerox_20170801 | 150 | 152 | 2 60 | Kellogg Co._20170504 | 173 | 175 | 2 61 | Noble Energy Inc_20170803 | 180 | 182 | 2 62 | Mastercard Inc._20170727 | 279 | 281 | 2 63 | F5 Networks_20170726 | 189 | 191 | 2 64 | Vulcan Materials_20170802 | 259 | 261 | 2 65 | Kellogg Co._20170803 | 204 | 202 | 2 66 | Amazon.com Inc._20170202 | 57 | 59 | 2 67 | Synopsys Inc._20170517 | 180 | 178 | 2 68 | Alaska Air Group Inc_20171025 | 88 | 90 | 2 69 | Campbell Soup_20171121 | 171 | 173 | 2 70 | Intuit Inc._20170523 | 295 | 293 | 2 71 | Aetna Inc_20171031 | 140 | 142 | 2 72 | Starbucks Corp._20170126 | 117 | 119 | 2 73 | Martin Marietta Materials_20170801 | 423 | 425 | 2 74 | Home Depot_20170516 | 100 | 102 | 2 75 | Skyworks Solutions_20170720 | 221 | 223 | 2 76 | Walgreens Boots Alliance_20171025 | 264 | 266 | 2 77 | PPG Industries_20170119 | 282 | 284 | 2 78 | Amgen Inc._20170725 | 176 | 178 | 2 79 | Schlumberger Ltd._20170421 | 225 | 227 | 2 80 | Salesforce.com_20170228 | 145 | 147 | 2 81 | United Parcel Service_20170727 | 170 | 172 | 2 82 | Salesforce.com_20170822 | 188 | 190 | 2 83 | AT&T Inc._20171024 | 399 | 397 | 2 84 | Garmin Ltd._20170222 | 144 | 146 | 2 85 | Kohl's Corp._20170223 | 197 | 199 | 2 86 | Kimberly-Clark_20170124 | 265 | 267 | 2 87 | Halliburton Co._20171023 | 270 | 272 | 2 88 | Merck & Co._20170202 | 87 | 86 | 1 89 | SVB Financial_20170126 | 170 | 169 | 1 90 | Darden Restaurants_20170926 | 220 | 219 | 1 91 | Anthem Inc._20170426 | 190 | 191 | 1 92 | FLIR Systems_20171025 | 67 | 66 | 1 93 | NetApp_20171115 | 168 | 169 | 1 94 | Goldman Sachs Group_20170118 | 11 | 12 | 1 95 | Chevron Corp._20170127 | 56 | 55 | 1 96 | Digital Realty Trust Inc_20170427 | 226 | 227 | 1 97 | Equinix_20170802 | 155 | 154 | 1 98 | American Tower Corp A_20170427 | 156 | 157 | 1 99 | The Hershey Company_20170203 | 95 | 96 | 1 100 | Mastercard Inc._20170131 | 260 | 261 | 1 101 | eBay Inc._20170125 | 183 | 184 | 1 102 | Tractor Supply Company_20170726 | 158 | 159 | 1 103 | Motorola Solutions Inc._20170803 | 102 | 103 | 1 104 | Motorola Solutions Inc._20171102 | 64 | 65 | 1 105 | Baxter International Inc._20170726 | 180 | 181 | 1 106 | Edison Int'l_20170501 | 158 | 157 | 1 107 | Nordstrom_20171109 | 125 | 124 | 1 108 | Royal Caribbean Cruises Ltd_20171107 | 213 | 214 | 1 109 | Wec Energy Group Inc_20170502 | 166 | 165 | 1 110 | Nordstrom_20170223 | 116 | 117 | 1 111 | Royal Caribbean Cruises Ltd_20170801 | 152 | 153 | 1 112 | Estee Lauder Cos._20170202 | 218 | 219 | 1 113 | Tractor Supply Company_20171025 | 178 | 179 | 1 114 | The Clorox Company_20171101 | 48 | 49 | 1 115 | Nasdaq, Inc._20170131 | 236 | 235 | 1 116 | CMS Energy_20170501 | 192 | 191 | 1 117 | AES Corp_20171102 | 209 | 208 | 1 118 | CVS Health_20170209 | 218 | 219 | 1 119 | Ulta Beauty_20171130 | 209 | 210 | 1 120 | Molson Coors Brewing Company_20170802 | 97 | 96 | 1 121 | SunTrust Banks_20171020 | 99 | 98 | 1 122 | Hilton Worldwide Holdings Inc_20170215 | 313 | 314 | 1 123 | Gartner Inc_20171102 | 244 | 243 | 1 124 | Gap Inc._20170223 | 327 | 326 | 1 125 | Cognizant Technology Solutions_20170803 | 105 | 106 | 1 126 | Dollar General_20171207 | 74 | 75 | 1 127 | Carmax Inc_20170922 | 273 | 272 | 1 128 | Automatic Data Processing_20170727 | 270 | 271 | 1 129 | ResMed_20171026 | 187 | 188 | 1 130 | PPL Corp._20170201 | 132 | 133 | 1 131 | NiSource Inc._20170503 | 44 | 45 | 1 132 | Tyson Foods_20170206 | 271 | 272 | 1 133 | Alaska Air Group Inc_20170208 | 115 | 116 | 1 134 | Hologic_20170201 | 196 | 197 | 1 135 | Estee Lauder Cos._20170818 | 250 | 251 | 1 136 | Martin Marietta Materials_20170502 | 426 | 427 | 1 137 | Hewlett Packard Enterprise_20170531 | 170 | 171 | 1 138 | Yum! Brands Inc_20170803 | 155 | 156 | 1 139 | Ross Stores_20171116 | 73 | 74 | 1 140 | Microsoft Corp._20170427 | 167 | 166 | 1 141 | Coty, Inc_20171109 | 205 | 204 | 1 142 | Wec Energy Group Inc_20171026 | 140 | 139 | 1 143 | Comcast Corp._20170126 | 128 | 129 | 1 144 | Motorola Solutions Inc._20170504 | 84 | 83 | 1 145 | Michael Kors Holdings_20170207 | 248 | 249 | 1 146 | Gartner Inc_20170202 | 215 | 216 | 1 147 | Becton Dickinson_20170803 | 244 | 243 | 1 148 | PayPal_20170126 | 150 | 151 | 1 149 | Texas Instruments_20170124 | 113 | 114 | 1 150 | Emerson Electric Company_20170207 | 521 | 520 | 1 151 | F5 Networks_20170426 | 166 | 167 | 1 152 | PepsiCo Inc._20170426 | 174 | 175 | 1 153 | Juniper Networks_20170425 | 213 | 214 | 1 154 | Sysco Corp._20171106 | 195 | 196 | 1 155 | Allergan, Plc_20170803 | 173 | 174 | 1 156 | Lockheed Martin Corp._20170718 | 148 | 147 | 1 157 | Cimarex Energy_20170510 | 71 | 72 | 1 158 | Lennar Corp._20171003 | 199 | 198 | 1 159 | Franklin Resources_20170428 | 15 | 16 | 1 160 | Polo Ralph Lauren Corp._20170202 | 180 | 181 | 1 161 | PPL Corp._20170803 | 88 | 89 | 1 162 | Ross Stores_20170228 | 85 | 86 | 1 163 | The Walt Disney Company_20170509 | 85 | 86 | 1 164 | Martin Marietta Materials_20170214 | 468 | 467 | 1 165 | Stryker Corp._20171026 | 101 | 100 | 1 166 | Nielsen Holdings_20171025 | 131 | 132 | 1 167 | Bristol-Myers Squibb_20170427 | 91 | 92 | 1 168 | Synopsys Inc._20171129 | 159 | 160 | 1 169 | Kohl's Corp._20170511 | 96 | 97 | 1 170 | Intel Corp._20170727 | 208 | 209 | 1 171 | Lowe's Cos._20171121 | 72 | 73 | 1 172 | Abbott Laboratories_20171018 | 244 | 245 | 1 173 | Parker-Hannifin_20171102 | 187 | 188 | 1 174 | Kansas City Southern_20170120 | 117 | 116 | 1 175 | Stericycle Inc_20171108 | 64 | 65 | 1 176 | Target Corp._20170816 | 119 | 120 | 1 177 | Masco Corp._20170727 | 159 | 158 | 1 178 | Lilly (Eli) & Co._20170425 | 102 | 101 | 1 179 | Celgene Corp._20170126 | 142 | 143 | 1 180 | Coca-Cola Company (The)_20170209 | 146 | 147 | 1 181 | Xerox_20170425 | 125 | 126 | 1 182 | Oracle Corp._20170914 | 63 | 64 | 1 183 | Digital Realty Trust Inc_20170216 | 172 | 173 | 1 184 | SCANA Corp_20170427 | 108 | 107 | 1 185 | The Mosaic Company_20170801 | 99 | 100 | 1 186 | The Walt Disney Company_20170207 | 97 | 98 | 1 187 | Incyte_20170504 | 46 | 45 | 1 188 | CA, Inc._20170802 | 74 | 75 | 1 189 | NRG Energy_20170228 | 178 | 179 | 1 190 | F5 Networks_20170125 | 155 | 156 | 1 191 | Foot Locker Inc_20170519 | 155 | 156 | 1 192 | Broadridge Financial Solutions_20170510 | 115 | 114 | 1 193 | AMETEK Inc._20170802 | 252 | 253 | 1 194 | Newell Brands_20170206 | 102 | 103 | 1 195 | CenturyLink Inc_20170802 | 184 | 185 | 1 196 | Gilead Sciences_20170502 | 113 | 112 | 1 197 | Norwegian Cruise Line_20170222 | 78 | 79 | 1 198 | CIGNA Corp._20170505 | 272 | 273 | 1 199 | Red Hat Inc._20170925 | 108 | 109 | 1 200 | Akamai Technologies Inc_20170502 | 198 | 199 | 1 201 | Hilton Worldwide Holdings Inc_20170502 | 294 | 295 | 1 202 | Ingersoll-Rand PLC_20170426 | 145 | 144 | 1 203 | Cisco Systems_20171115 | 163 | 164 | 1 204 | Regions Financial Corp._20170721 | 221 | 220 | 1 205 | Incyte_20170214 | 66 | 67 | 1 206 | TE Connectivity Ltd._20170726 | 227 | 226 | 1 207 | AmerisourceBergen Corp_20170504 | 249 | 248 | 1 208 | PPL Corp._20170504 | 99 | 98 | 1 209 | Church & Dwight_20171102 | 168 | 167 | 1 210 | The Bank of New York Mellon Corp._20170720 | 171 | 172 | 1 211 | AbbVie Inc._20170728 | 64 | 65 | 1 212 | Altria Group Inc_20171026 | 165 | 166 | 1 213 | General Growth Properties Inc._20170501 | 247 | 248 | 1 214 | Mondelez International_20170802 | 184 | 185 | 1 215 | Thermo Fisher Scientific_20170426 | 235 | 234 | 1 216 | Church & Dwight_20170803 | 135 | 136 | 1 217 | Waste Management Inc._20171026 | 158 | 159 | 1 218 | Campbell Soup_20170519 | 208 | 207 | 1 219 | Garmin Ltd._20171101 | 148 | 149 | 1 220 | IPG Photonics Corp._20170502 | 199 | 200 | 1 221 | Iron Mountain Incorporated_20170223 | 140 | 141 | 1 222 | Raytheon Co._20171026 | 139 | 140 | 1 223 | Skyworks Solutions_20170427 | 210 | 209 | 1 224 | Microsoft Corp._20170126 | 190 | 191 | 1 225 | Aon plc_20171027 | 110 | 111 | 1 226 | The Mosaic Company_20170502 | 82 | 83 | 1 227 | Gap Inc._20170518 | 209 | 210 | 1 228 | PACCAR Inc._20171024 | 226 | 227 | 1 229 | Electronic Arts_20171031 | 247 | 248 | 1 230 | CBS Corp._20171102 | 243 | 244 | 1 231 | United Health Group Inc._20170418 | 90 | 91 | 1 232 | Verizon Communications_20171019 | 253 | 252 | 1 233 | Tractor Supply Company_20170426 | 89 | 90 | 1 234 | Becton Dickinson_20171102 | 202 | 201 | 1 235 | Autodesk Inc._20170302 | 85 | 86 | 1 236 | Celgene Corp._20170727 | 161 | 162 | 1 237 | KLA-Tencor Corp._20170427 | 132 | 133 | 1 238 | Goldman Sachs Group_20170718 | 241 | 242 | 1 239 | Coty, Inc_20170209 | 211 | 210 | 1 240 | Intercontinental Exchange_20170207 | 127 | 128 | 1 241 | American Tower Corp A_20171031 | 133 | 134 | 1 242 | Verizon Communications_20170727 | 257 | 258 | 1 243 | MGM Resorts International_20170727 | 222 | 223 | 1 244 | Baxter International Inc._20170426 | 221 | 222 | 1 245 | Comcast Corp._20170727 | 86 | 87 | 1 246 | Align Technology_20170427 | 206 | 205 | 1 247 | Boeing Company_20170426 | 307 | 306 | 1 248 | International Paper_20170202 | 137 | 136 | 1 249 | American Tower Corp A_20171031 | 133 | 134 | 1 250 | Verizon Communications_20170727 | 257 | 258 | 1 251 | MGM Resorts International_20170727 | 222 | 223 | 1 252 | Baxter International Inc | 221 | 222 | 1 253 | Comcast Corp | 86 | 87 | 1 254 | Align Technology_20170427 | 206 | 205 | 1 255 | Boeing Company_20170426 | 307 | 306 | 1 256 | International Paper_20170202 | 137 | 136 | 1 257 | The Clorox Company_20170203 | 94 | 95 | 1 258 | Estee Lauder Cos | 226 | 225 | 1 259 | Intel Corp | 108 | 107 | 1 260 | Martin Marietta Materials_20171102 | 425 | 426 | 1 261 | CBS Corp | 190 | 191 | 1 262 | Celgene Corp | 84 | 83 | 1 263 | Apache Corporation_20170223 | 260 | 261 | 1 264 | ABIOMED Inc_20170504 | 216 | 217 | 1 265 | Rockwell Automation Inc | 170 | 171 | 1

firmai commented 2 years ago

@huangruizhe are you also planning on grabbing newer data with this replication?

GeminiLn commented 2 years ago

Hi, thank you for pointing out the issue. It has been a while since our work is done, I need some time to check the data and let you know if there is anything wrong with the released data.

huangruizhe commented 2 years ago

@firmai Our main interests are in automatic speech recognition. For the moment, we will probably work with existing datasets. @GeminiLn Thanks in advance!!

GeminiLn commented 2 years ago

Hi, I checked the data and noticed that the data on our side is correct. But the text sequences in the released data are mismatched for some reason. I'll update the text data within one or two days. The audio sequences are correct.

Thank you again for pointing out the problem! @huangruizhe

GeminiLn commented 2 years ago

@huangruizhe Sorry for the delay. I experienced some connection problems last week. The new dataset is available now. Please find the link in the README file. The audio and text should be all matched now.

huangruizhe commented 2 years ago

Thanks @GeminiLn ! I will definitely check it out. Thanks again for the update and the work!

huangruizhe commented 2 years ago

Hi @GeminiLn, I have checked out the new dataset, but something may have gone wrong. After downloading all zip files and unzipping them (which takes a long time), I get only 53 calls instead of 575 as before.

Would you suggest what may have been wrong?

huangruizhe commented 2 years ago

During unzipping, there came a lot of errors, e.g.

... file #82663: bad zipfile offset (lseek): 1046069248 file #82664: bad zipfile offset (lseek): 1046151168 file #82665: bad zipfile offset (lseek): 1046323200 file #82666: bad zipfile offset (lseek): 1046364160 file #82667: bad zipfile offset (lseek): 1046429696 file #82668: bad zipfile offset (lseek): 1046462464 file #82669: bad zipfile offset (lseek): 1046544384 file #82670: bad zipfile offset (lseek): 1046683648 file #82671: bad zipfile offset (lseek): 1046740992 file #82672: bad zipfile offset (lseek): 1046773760 file #82673: bad zipfile offset (lseek): 1046798336 file #82674: bad zipfile offset (lseek): 1046953984 file #82675: bad zipfile offset (lseek): 1047076864 file #82676: bad zipfile offset (lseek): 1047150592 file #82677: bad zipfile offset (lseek): 1047265280 file #82678: bad zipfile offset (lseek): 1047347200 file #82679: bad zipfile offset (lseek): 1047412736 file #82680: bad zipfile offset (lseek): 1047494656

I was using this command under the directory as in the screenshot: unzip -qq ACL19_Release.zip Just FYI.

huangruizhe commented 2 years ago

I guess it might be the unzip issue: https://support.firmex.com/hc/en-us/articles/204579673-Downloads-with-multiple-parts-z01-and-z02-files-#1-download-all-the-parts-to-the-same-folder-on-your-computer-0-1

You might have worked on windows and used Winzip or WinRAR to compress the files. I may need to use the same software to uncompress them, instead of "unzip" command on linux. I will look into this.

huangruizhe commented 2 years ago

I happen to have a windows machine, so I could unzip the dataset with Winzip successfully. Would be great if the dataset can be released in other formats compatible with Linux (-- just a small suggestion).

GeminiLn commented 2 years ago

Hi @huangruizhe , thank you for the feedback. Actually, the file processing and compression are done on a Linux machine. I'll look into it to see if there is anything wrong during the compress. Are you able to access the full dataset with WinZip?

GeminiLn commented 2 years ago

Hi, @huangruizhe . I tested it on my Linux machine. You might need to use: zip -s0 ACL19_Release.zip --out ACL19_Release_All.zip to merge the files. Then use: unzip -q ACL19_Release_all.zip to unzip the dataset. Sorry for the inconvenience. I will write an instruction in the README file. The new dataset is split because some researchers from China mainland have trouble downloading large files from Google Drive.

huangruizhe commented 2 years ago

When I switched to WinZip, it went okay -- so I assumed that this was how the data was prepared. I finally got 572 earnings calls and 89722 audios in total. Is that correct? Thanks for the commands for merging and unzipping the files. It will be useful for others who are interested in the dataset.

GeminiLn commented 2 years ago

My guess is that WinZip merges the files automatically, but the zip command on Linux will not do that. And yes, you get the correct dataset. I hope it will be helpful for your research.

huangruizhe commented 2 years ago

Thanks you!

GeminiLn / EarningsCall_Dataset

Mismatched recordings and transcriptions #4