Closed Grantskie closed 3 years ago
General Tech Job Spikes Count the number of urls in the csv to get overall tech jobs and graph by day. Finding daily/weekly/monthly changes and reporting only those that exceed certain thresholds, possibly with a flat value or percentage change, may be blind towards spikes that happen over multiple days/weeks/months Specific Tech Job Positions Word count by url and select the top three appearances for each week of the Quarter?
Currently, this query will be focused on the CC-MAIN-2021-10
crawl (which includes the months of February & March 2021) in order to produce an MVP in the soonest possible time. The stretch goal of processing all crawls in 2021 will be considered once this is good to go
The following is the query for the programming languages count for a .CSV file. I am having to call a limit on it because it repeats rows.
spark.sql("SELECT (select count(url) from dat where url like '%java%') as java, (SELECT count(url) from dat where url like '%python%') as python, (SELECT count(url) from dat where url like '%scala%') as scala, (SELECT count(url) from dat where url like '%matlab%') as matlab, (SELECT count(url) from dat where url like '%SQL%') as sql from dat limit 1").show()
Internal state when error was thrown: recordCount=11277, recordData=[com,cms24-7,jobs)/job/registered-nurse-rn-medical-surgical-ms-registered-nurse-rn-elizabeth-nj-nj-15364401/381149d2-1866-11eb-bf99-42010a8a003a?listingurl=http://jobs.cms24-7.com/job/registered-nurse-rn-intensive-care-unit-icu-registered-nurse-rn-pennington-nj-nj-13771025/183b192d-eb87-11ea-b61e-42010a8a0ff4?listingurl=http://jobs.cms24-7.com/3naf1s/registered-nurse-rn-cardiovascular-intesive-care-unit-icucvicu-registered-nurse-rn-pomona-nj-13935528?id=8d4ff9ff-ee29-11ea-b015-42010a8a0ff4, https://jobs.cms24-7.com/job/registered-nurse-rn-medical-surgical-ms-registered-nurse-rn-elizabeth-nj-nj-15364401/381149d2-1866-11eb-bf99-42010a8a003a?listingUrl=%2568%2574%2574%2570%253A%252F%252F%256A%256F%2562%2573%252E%2563%256D%2573%2532%2534%252D%2537%252E%2563%256F%256D%252F%256A%256F%2562%252F%2572%2565%2567%2569%2573%2574%2565%2572%2565%2564%252D%256E%2575%2572%2573%2565%252D%2572%256E%252D%2569%256E%2574%2565%256E%2573%2569%2576%2565%252D%2563%2561%2572%2565%252D%2575%256E%2569%2574%252D%2569%2563%2575%252D%2572%2565%2567%2569%2573%2574%2565%2572%2565%2564%252D%256E%2575%2572%2573%2565%252D%2572%256E%252D%2570%2565%256E%256E%2569%256E%2567%2574%256F%256E%252D%256E%256A%252D%256E%256A%252D%2531%2533%2537%2537%2531%2530%2532%2535%252F%2531%2538%2533%2562%2531%2539%2532%2564%252D%2565%2562%2538%2537%252D%2531%2531%2565%2561%252D%2562%2536%2531%2565%252D%2534%2532%2530%2531%2530%2561%2538%2561%2530%2566%2566%2534%253F%256C%2569%2573%2574%2569%256E%2567%2555%2572%256C%253D%2525%2532%2535%2536%2538%2525%2532%2535%2537%2534%2525%2532%2535%2537%2534%2525%2532%2535%2537%2530%2525%2532%2535%2533%2541%2525%2532%2535%2532%2546%2525%2532%2535%2532%2546%2525%2532%2535%2536%2541%2525%2532%2535%2536%2546%2525%2532%2535%2536%2532%2525%2532%2535%2537%2533%2525%2532%2535%2532%2545%2525%2532%2535%2536%2533%2525%2532%2535%2536%2544%2525%2532%2535%2537%2533%2525%2532%2535%2533%2532%2525%2532%2535%2533%2534%2525%2532%2535%2532%2544%2525%2532%2535%2533%2537%2525%2532%2535%2532%2545%2525%2532%2535%2536%2533%2525%2532%2535%2536%2546%2525%2532%2535%2536%2544%2525%2532%2535%2532%2546%2525%2532%2535%2533%2533%2525%2532%2535%2536%2545%2525%2532%2535%2536%2531%2525%2532%2535%2536%2536%2525%2532%2535%2533%2531%2525%2532%2535%2537%2533%2525%2532%2535%2532%2546%2525%2532%2535%2537%2532%2525%2532%2535%2536%2535%2525%2532%2535%2536%2537%2525%2532%2535%2536%2539%2525%2532%2535%2537%2533%2525%2532%2535%2537%2534%2525%2532%2535%2536%2535%2525%2532%2535%2537%2532%2525%2532%2535%2536%2535%2525%2532%2535%2536%2534%2525%2532%2535%2532%2544%2525%2532%2535%2536%2545%2525%2532%2535%2537%2535%2525%2532%2535%2537%2532%2525%2532%2535%2537%2533%2525%2532%2535%2536%2535%2525%2532%2535%2532%2544%2525%2532%2535%2537%2532%2525%2532%2535%2536%2545%2525%2532%2535%2532%2544%2525%2532%2535%2536%2533%2525%2532%2535%2536%2531%2525%2532%2535%2537%2532%2525%2532%2535%2536%2534%2525%2532%2535%2536%2539%2525%2532%2535%2536%2546%2525%2532%2535%2537%2536%2525%2532%2535%2536%2531%2525%2532%2535%2537%2533%2525%2532%2535%2536%2533%2525%2532%2535%2537%2535%2525%2532%2535%2536%2543%2525%2532%2535%2536%2531%2525%2532%2535%2537%2532%2525%2532%2535%2532%2544%2525%2532%2535%2536%2539%2525%2532%2535%2536%2545%2525%2532%2535%2537%2534%2525%2532%2535%2536%2535%2525%2532%2535%2537%2533%2525%2532%2535%2536%2539%2525%2532%2535%2537%2536%2525%2532%2535%2536%2535%2525%2532%2535%2532%2544%2525%2532%2535%2536%2533%2525%2532%2535%2536%2531%2525%2532%2535%2537%2532%2525%2532%2535%2536%2535%2525%2532%2535%2532%2544%2525%2532%2535%2537%2535%2525%2532%2535%2536%2545%2525%2532%2535%2536%2539%2525%2532%2535%2537%2534%2525%2532%2535%2532%2544%2525%2532%2535%2536%2539%2525%2532%2535%2536%2533%2525%2532%2535%2537%2535%2525%2532%2535%2536%2533%2525%2532%2535%2537%2536%2525%2532%2535%2536%2539%2525%2532%2535%2536%2533%2525%2532%2535%2537%2535%2525%2532%2535%2532%2544%2525%2532%2535%2537%2532%2525%2532%2535%2536%2535%2525%2532%2535%2536%2537%2525%2532%2535%2536%2539%2525%2532%2535%2537%2533%2525%2532%2535%2537%2534%2525%2532%2535%2536%2535%2525%2532%2535%2537%2532%2525%2532%2535%2536%2535%2525%2532%2535%2536%2534%2525%2532%2535%2532%2544%2525%2532%2535%2536%2545%2525%2532%2535%2537%2535%2525%2532%2535%2537%2532%2525%2532%2535%2537%2533%2525%2532%2535%2536%2535%2525%2532%2535%2532%2544%2525%2532%2535%2537%2532%2525%2532%2535%2536%2545%2525%2532%2535%2532%2544%2525%2532%2535%2537%2530%2525%2532%2535%2536%2546%2525%2532%2535%2536%2544%2525%2532%2535%2536%2546%2525%2532%2535%2536%2545%2525%2532%2535%2536%2531%2525%2532%2535%2532%2544%2525%2532%2535%2536%2545%2525%2532%2535%2536%2541%2525%2532%2535%2532%2544%2525%2532%2535%2533%2531%2525%2532%2535%2533%2533%2525%2532%2535%2533%2539%2525%2532%2535%2533%2533%2525%2532%2535%2533%2535%2525%2532%2535%2533%2535%2525%2532%2535%2533%2532%2525%2532%2535%2533%2538%2525%2532%2535%2533%2546%2525%2532%2535%2536%2539%2525%2532%2535%2536%2534%2525%2532%2535%2533%2544%2525%2532%2535%2533%2538%2525%2532%2535%2536%2534%2525%2532%2535%2533%2534%2525%2532%2535%2536%2536%2525%2532%2535%2536%2536%2525%2532%2535%2533%2539%2525%2532%2535%2536%2536%2525%2532%2535%2536%2536%2525%2532%2535%2532%2544%2525%2532%2535%2536%2535%2525%2532%2535%2536%2535%2525%2532%2535%2533%2532%2525%2532%2535%2533%2539%2525%2532%2535%2532%2544%2525%2532%2535%2533%2531%2525%2532%2535%2533%2531%2525%2532%2535%2536%2535%2525%2532%2535%2536%2531%2525%2532%2535%2532%2544%2525%2532%2535%2536%2532%2525%2532%2535%2533%2530%2525%2532%2535%2533%2531%2525%2532%2535%2533%2535%2525%2532%2535%2532%2544%2525%2532%2535%2533%2534%2525%2532%2535%2533%2532%2525%2532%2535%2533%2530%2525%2532%2535%2533%2531%2525%2532%2535%2533%2530%2525%2532%2535%2536%2531%2525%2532%2535%2533%2538%2525%2532%2535%2536%2531%2525%2532%2535%2533%2530%2525%2532%2535%2536%2536%2525%2532%2535%2536%2536%2525%2532%2535%2533%2534, jobs.cms24-7.com, com, cms24-7, jobs, , , com, cms24-7.com, com, cms24-7.com, https, , /job/registered-nurse-rn-medical-surgical-ms-registered-nurse-rn-elizabeth-nj-nj-15364401/381149d2-1866-11eb-bf99-42010a8a003a, listingUrl=%2568%2574%2574%2570%253A%252F%252F%256A%256F%2562%2573%252E%2563%256D%2573%2532%2534%252D%2537%252E%2563%256F%256D%252F%256A%256F%2562%252F%2572%2565%2567%2569%2573%2574%2565%2572%2565%2564%252D%256E%2575%2572%2573%2565%252D%2572%256E%252D%2569%256E%2574%2565%256E%2573%2569%2576%2565%252D%2563%2561%2572%2565%252D%2575%256E%2569%2574%252D%2569%2563%2575%252D%2572%2565%2567%2569%2573%2574%2565%2572%2565%2564%252D%256E%2575%2572%2573%2565%252D%2572%256E%252D%2570%2565%256E%256E%2569%256E%2567%2574%256F%256E%252D%256E%256A%252D%256E%256A%252D%2531%2533%2537%2537%2531%2530%2532%2535%252F%2531%2538%2533%2562%2531%2539%2532%2564%252D%2565%2562%2538%2537%252D%2531%2531%2565%2561%252D%2562%2536%2531%2565%252D%2534%2532%2530%2531%2530%2561%2538%2561%2530%2566%2566%2534%253F%256C%2569%2573%2574%2569%256E%2567%2555%2572%256C%253D%2525%2532%2535%2536%2538%2525%2532%2535%2537%2534%2525%2532%2535%2537%2534%2525%2532%2535%2537%2530%2525%2532%2535%2533%2541%2525%2532%2535%2532%2546%2525%2532%2535%2532%2546%2525%2532%2535%2536%2541%2525%2532%2535%2536%2546%2525%2532%2535%2536%2532%2525%2532%2535%2537%2533%2525%2532%2535%2532%2545%2525%2532%2535%2536%2533%2525%2532%2535%2536%2544%2525%2532%2535%2537%2533%2525%2532%2535%2533%2532%2525%2532%2535%2533%2534%2525%2532%2535%2532%2544%2525%2532%2535%2533%2537%2525%2532%2535%2532%2545%2525%2532%2535%2536%2533%2525%2532%2535%2536%2546%2525%2532%2535%2536%2544%2525%2532%2535%2532%2546%2525%2532%2535%2533%2533%2525%2532%2535%2536%2545%2525%2532%2535%2536%2531%2525%2532%2535%2536%2536%2525%2532%2535%2533%2531%2525%2532%2535%2537%2533%2525%2532%2535%2532%2546%2525%2532%2535%2537%2532%2525%2532%2535%2536%2535%2525%2532%2535%2536%2537%2525%2532%2535%2536%2539%2525%2532%2535%2537%2533%2525%2532%2535%2537%2534%2525%2532%2535%2536%2535%2525%2532%2535%2537%2532%2525%2532%2535%2536%2535%2525%2532%2535%2536%2534%2525%2532%2535%2532%2544%2525%2532%2535%2536%2545%2525%2532%2535%2537%2535%2525%2532%2535%2537%2532%2525%2532%2535%2537%2533%2525%2532%2535%2536%2535%2525%2532%2535%2532%2544%2525%2532%2535%2537%2532%2525%2532%2535%2536%2545%2525%2532%2535%2532%2544%2525%2532%2535%2536%2533%2525%2532%2535%2536%2531%2525%2532%2535%2537%2532%2525%2532%2535%2536%2534%2525%2532%2535%2536%2539%2525%2532%2535%2536%2546%2525%2532%2535%2537%2536%2525%2532%2535%2536%2531%2525%2532%2535%2537%2533%2525%2532%2535%2536%2533%2525%2532%2535%2537%2535%2525%2532%2535%2536%2543%2525%2532%2535%2536%2531%2525%2532%2535%2537%2532%2525%2532%2535%2532%2544%2525%2532%2535%2536%2539%2525%2532%2535%2536%2545%2525%2532%2535%2537%2534%2525%2532%2535%2536%2535%2525%2532%2535%2537%2533%2525%2532%2535%2536%2539%2525%2532%2535%2537%2536%2525%2532%2535%2536%2535%2525%2532%2535%2532%2544%2525%2532%2535%2536%2533%2525%2532%2535%2536%2531%2525%2532%2535%2537%2532%2525%2532%2535%2536%2535%2525%2532%2535%2532%2544%2525%2532%2535%2537%2535%2525%2532%2535%2536%2545%2525%2532%2535%2536%2539%2525%2532%2535%2537%2534%2525%2532%2535%2532%2544%2525%2532%2535%2536%2539%2525%2532%2535%2536%2533%2525%2532%2535%2537%2535%2525%2532%2535%2536%2533%2525%2532%2535%2537%2536%2525%2532%2535%2536%2539%2525%2532%2535%2536%2533%2525%2532%2535%2537%2535%2525%2532%2535%2532%2544%2525%2532%2535%2537%2532%2525%2532%2535%2536%2535%2525%2532%2535%2536%2537%2525%2532%2535%2536%2539%2525%2532%2535%2537%2533%2525%2532%2535%2537%2534%2525%2532%2535%2536%2535%2525%2532%2535%2537%2532%2525%2532%2535%2536%2535%2525%2532%2535%2536%2534%2525%2532%2535%2532%2544%2525%2532%2535%2536%2545%2525%2532%2535%2537%2535%2525%2532%2535%2537%2532%2525%2532%2535%2537%2533%2525%2532%2535%2536%2535%2525%2532%2535%2532%2544%2525%2532%2535%2537%2532%2525%2532%2535%2536%2545%2525%2532%2535%2532%2544%2525%2532%2535%2537%2530%2525%2532%2535%2536%2546%2525%2532%2535%2536%2544%2525%2532%2535%2536%2546%2525%2532%2535%2536%2545%2525%2532%2535%2536%2531%2525%2532%2535%2532%2544%2525%2532%2535%2536%2545%2525%2532%2535%2536%2541%2525%2532%2535%2532%2544%2525%2532%2535%2533%2531%2525%2532%2535%2533%2533%2525%2532%2535%2533%2539%2525%2532%2535%2533%2533%2525%2532%2535%2533%2535%2525%2532%2535%2533%2535%2525%2532%2535%2533%2532%2525%2532%2535%2533%2538%2525%2532%2535%2533%2546%2525%2532%2535%2536%2539%2525%2532%2535%2536%2534%2525%2532%2535%2533%2544%2525%2532%2535%2533%2538%2525%2532%2535%2536%2534%2525%2532%2535%2533%2534%2525%2532%2535%2536%2536%2525%2532%2535%2536%2536%2525%2532%2535%2533%2539%2525%2532%2535%2536%2536%2525%2532%2535%2536%2536%2525%2532%2535%2532%2544%2525%2532%2535%2536%2535%2525%2532%2535%2536%2535%2525%2532%2535%2533%2532%2525%2532%2535%2533%2539%2525%2532%2535%2532%2544%2525%2532%2535%2533%2531%2525%2532%2535%2533%2531%2525%2532%2535%2536%2535%2525%2532%2535%2536%2531%2525%2532%2535%2532%2544%2525%2532%2535%2536%2532%2525%2532%2535%2533%2530%2525%2532%2535%2533%2531%2525%2532%2535%2533%2535%2525%2532%2535%2532%2544%2525%2532%2535%2533%2534%2525%2532%2535%2533%2532%2525%2532%2535%2533%2530%2525%2532%2535%2533%2531%2525%2532%2535%2533%2530%2525%2532%2535%2536%2531%2525%2532%2535%2533%2538%2525%2532%2535%2536%2531%2525%2532%2535%2533%2530%2525%2532%2535%2536%2536%2525%2532%2535%2536%2536%2525%2532%2535%2533%2534, 2021-01-15T13:51:04.000-06:00, 200, , OJHZNR6QGYFDJKBWIJVZPEGKTHZXAKLG, text/html, text/html, UTF-8, eng, , crawl-data/CC-MAIN-2021-04/segments/1610703496947.2/warc/CC-MAIN-20210115194851-20210115224851-00331.warc.gz, 414510034, 46855, 1610703496947.2, CC-MAIN-2021-04, warc]
at com.univocity.parsers.common.AbstractWriter.throwExceptionAndClose(AbstractWriter.java:916)
AbstractWriter.java:916
at com.univocity.parsers.common.AbstractWriter.writeRow(AbstractWriter.java:706)
AbstractWriter.java:706
at org.apache.spark.sql.execution.datasources.csv.UnivocityGenerator.write(UnivocityGenerator.scala:82)
UnivocityGenerator.scala:82
at org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.write(CSVFileFormat.scala:139)
CSVFileFormat.scala:139
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:327)
FileFormatWriter.scala:327
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
FileFormatWriter.scala:258
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
FileFormatWriter.scala:256
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
Utils.scala:1375
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
FileFormatWriter.scala:261
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
FileFormatWriter.scala:191
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
FileFormatWriter.scala:190
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
ResultTask.scala:87
at org.apache.spark.scheduler.Task.run(Task.scala:108)
Task.scala:108
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
Executor.scala:335
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.StringIndexOutOfBoundsException: offset 0, count 5292, length 4096
at java.base/java.lang.String.checkBoundsOffCount(String.java:3304)
at java.base/java.lang.String.getChars(String.java:855)
at com.univocity.parsers.common.input.DefaultCharAppender.append(DefaultCharAppender.java:240)
DefaultCharAppender.java:240
at com.univocity.parsers.common.input.ExpandingCharAppender.append(ExpandingCharAppender.java:193)
ExpandingCharAppender.java:193
at com.univocity.parsers.csv.CsvWriter.append(CsvWriter.java:296)
CsvWriter.java:296
at com.univocity.parsers.csv.CsvWriter.processRow(CsvWriter.java:191)
CsvWriter.java:191
at com.univocity.parsers.common.AbstractWriter.submitRow(AbstractWriter.java:316)
AbstractWriter.java:316
error occured when trying to read from all CSVs given a year from filtered index, separating by month and writing to multiple folders named for each month.
Internal state when error was thrown: recordCount=11277, recordData=[com,cms24-7,jobs)/job/registered-nurse-rn-medical-surgical-ms-registered-nurse-rn-elizabeth-nj-nj-15364401/381149d2-1866-11eb-bf99-42010a8a003a?listingurl=http://jobs.cms24-7.com/job/registered-nurse-rn-intensive-care-unit-icu-registered-nurse-rn-pennington-nj-nj-13771025/183b192d-eb87-11ea-b61e-42010a8a0ff4?listingurl=http://jobs.cms24-7.com/3naf1s/registered-nurse-rn-cardiovascular-intesive-care-unit-icucvicu-registered-nurse-rn-pomona-nj-13935528?id=8d4ff9ff-ee29-11ea-b015-42010a8a0ff4, https://jobs.cms24-7.com/job/registered-nurse-rn-medical-surgical-ms-registered-nurse-rn-elizabeth-nj-nj-15364401/381149d2-1866-11eb-bf99-42010a8a003a?listingUrl=%2568%2574%2574%2570%253A%252F%252F%256A%256F%2562%2573%252E%2563%256D%2573%2532%2534%252D%2537%252E%2563%256F%256D%252F%256A%256F%2562%252F%2572%2565%2567%2569%2573%2574%2565%2572%2565%2564%252D%256E%2575%2572%2573%2565%252D%2572%256E%252D%2569%256E%2574%2565%256E%2573%2569%2576%2565%252D%2563%2561%2572%2565%252D%2575%256E%2569%2574%252D%2569%2563%2575%252D%2572%2565%2567%2569%2573%2574%2565%2572%2565%2564%252D%256E%2575%2572%2573%2565%252D%2572%256E%252D%2570%2565%256E%256E%2569%256E%2567%2574%256F%256E%252D%256E%256A%252D%256E%256A%252D%2531%2533%2537%2537%2531%2530%2532%2535%252F%2531%2538%2533%2562%2531%2539%2532%2564%252D%2565%2562%2538%2537%252D%2531%2531%2565%2561%252D%2562%2536%2531%2565%252D%2534%2532%2530%2531%2530%2561%2538%2561%2530%2566%2566%2534%253F%256C%2569%2573%2574%2569%256E%2567%2555%2572%256C%253D%2525%2532%2535%2536%2538%2525%2532%2535%2537%2534%2525%2532%2535%2537%2534%2525%2532%2535%2537%2530%2525%2532%2535%2533%2541%2525%2532%2535%2532%2546%2525%2532%2535%2532%2546%2525%2532%2535%2536%2541%2525%2532%2535%2536%2546%2525%2532%2535%2536%2532%2525%2532%2535%2537%2533%2525%2532%2535%2532%2545%2525%2532%2535%2536%2533%2525%2532%2535%2536%2544%2525%2532%2535%2537%2533%2525%2532%2535%2533%2532%2525%2532%2535%2533%2534%2525%2532%2535%2532%2544%2525%2532%2535%2533%2537%2525%2532%2535%2532%2545%2525%2532%2535%2536%2533%2525%2532%2535%2536%2546%2525%2532%2535%2536%2544%2525%2532%2535%2532%2546%2525%2532%2535%2533%2533%2525%2532%2535%2536%2545%2525%2532%2535%2536%2531%2525%2532%2535%2536%2536%2525%2532%2535%2533%2531%2525%2532%2535%2537%2533%2525%2532%2535%2532%2546%2525%2532%2535%2537%2532%2525%2532%2535%2536%2535%2525%2532%2535%2536%2537%2525%2532%2535%2536%2539%2525%2532%2535%2537%2533%2525%2532%2535%2537%2534%2525%2532%2535%2536%2535%2525%2532%2535%2537%2532%2525%2532%2535%2536%2535%2525%2532%2535%2536%2534%2525%2532%2535%2532%2544%2525%2532%2535%2536%2545%2525%2532%2535%2537%2535%2525%2532%2535%2537%2532%2525%2532%2535%2537%2533%2525%2532%2535%2536%2535%2525%2532%2535%2532%2544%2525%2532%2535%2537%2532%2525%2532%2535%2536%2545%2525%2532%2535%2532%2544%2525%2532%2535%2536%2533%2525%2532%2535%2536%2531%2525%2532%2535%2537%2532%2525%2532%2535%2536%2534%2525%2532%2535%2536%2539%2525%2532%2535%2536%2546%2525%2532%2535%2537%2536%2525%2532%2535%2536%2531%2525%2532%2535%2537%2533%2525%2532%2535%2536%2533%2525%2532%2535%2537%2535%2525%2532%2535%2536%2543%2525%2532%2535%2536%2531%2525%2532%2535%2537%2532%2525%2532%2535%2532%2544%2525%2532%2535%2536%2539%2525%2532%2535%2536%2545%2525%2532%2535%2537%2534%2525%2532%2535%2536%2535%2525%2532%2535%2537%2533%2525%2532%2535%2536%2539%2525%2532%2535%2537%2536%2525%2532%2535%2536%2535%2525%2532%2535%2532%2544%2525%2532%2535%2536%2533%2525%2532%2535%2536%2531%2525%2532%2535%2537%2532%2525%2532%2535%2536%2535%2525%2532%2535%2532%2544%2525%2532%2535%2537%2535%2525%2532%2535%2536%2545%2525%2532%2535%2536%2539%2525%2532%2535%2537%2534%2525%2532%2535%2532%2544%2525%2532%2535%2536%2539%2525%2532%2535%2536%2533%2525%2532%2535%2537%2535%2525%2532%2535%2536%2533%2525%2532%2535%2537%2536%2525%2532%2535%2536%2539%2525%2532%2535%2536%2533%2525%2532%2535%2537%2535%2525%2532%2535%2532%2544%2525%2532%2535%2537%2532%2525%2532%2535%2536%2535%2525%2532%2535%2536%2537%2525%2532%2535%2536%2539%2525%2532%2535%2537%2533%2525%2532%2535%2537%2534%2525%2532%2535%2536%2535%2525%2532%2535%2537%2532%2525%2532%2535%2536%2535%2525%2532%2535%2536%2534%2525%2532%2535%2532%2544%2525%2532%2535%2536%2545%2525%2532%2535%2537%2535%2525%2532%2535%2537%2532%2525%2532%2535%2537%2533%2525%2532%2535%2536%2535%2525%2532%2535%2532%2544%2525%2532%2535%2537%2532%2525%2532%2535%2536%2545%2525%2532%2535%2532%2544%2525%2532%2535%2537%2530%2525%2532%2535%2536%2546%2525%2532%2535%2536%2544%2525%2532%2535%2536%2546%2525%2532%2535%2536%2545%2525%2532%2535%2536%2531%2525%2532%2535%2532%2544%2525%2532%2535%2536%2545%2525%2532%2535%2536%2541%2525%2532%2535%2532%2544%2525%2532%2535%2533%2531%2525%2532%2535%2533%2533%2525%2532%2535%2533%2539%2525%2532%2535%2533%2533%2525%2532%2535%2533%2535%2525%2532%2535%2533%2535%2525%2532%2535%2533%2532%2525%2532%2535%2533%2538%2525%2532%2535%2533%2546%2525%2532%2535%2536%2539%2525%2532%2535%2536%2534%2525%2532%2535%2533%2544%2525%2532%2535%2533%2538%2525%2532%2535%2536%2534%2525%2532%2535%2533%2534%2525%2532%2535%2536%2536%2525%2532%2535%2536%2536%2525%2532%2535%2533%2539%2525%2532%2535%2536%2536%2525%2532%2535%2536%2536%2525%2532%2535%2532%2544%2525%2532%2535%2536%2535%2525%2532%2535%2536%2535%2525%2532%2535%2533%2532%2525%2532%2535%2533%2539%2525%2532%2535%2532%2544%2525%2532%2535%2533%2531%2525%2532%2535%2533%2531%2525%2532%2535%2536%2535%2525%2532%2535%2536%2531%2525%2532%2535%2532%2544%2525%2532%2535%2536%2532%2525%2532%2535%2533%2530%2525%2532%2535%2533%2531%2525%2532%2535%2533%2535%2525%2532%2535%2532%2544%2525%2532%2535%2533%2534%2525%2532%2535%2533%2532%2525%2532%2535%2533%2530%2525%2532%2535%2533%2531%2525%2532%2535%2533%2530%2525%2532%2535%2536%2531%2525%2532%2535%2533%2538%2525%2532%2535%2536%2531%2525%2532%2535%2533%2530%2525%2532%2535%2536%2536%2525%2532%2535%2536%2536%2525%2532%2535%2533%2534, jobs.cms24-7.com, com, cms24-7, jobs, , , com, cms24-7.com, com, cms24-7.com, https, , /job/registered-nurse-rn-medical-surgical-ms-registered-nurse-rn-elizabeth-nj-nj-15364401/381149d2-1866-11eb-bf99-42010a8a003a, listingUrl=%2568%2574%2574%2570%253A%252F%252F%256A%256F%2562%2573%252E%2563%256D%2573%2532%2534%252D%2537%252E%2563%256F%256D%252F%256A%256F%2562%252F%2572%2565%2567%2569%2573%2574%2565%2572%2565%2564%252D%256E%2575%2572%2573%2565%252D%2572%256E%252D%2569%256E%2574%2565%256E%2573%2569%2576%2565%252D%2563%2561%2572%2565%252D%2575%256E%2569%2574%252D%2569%2563%2575%252D%2572%2565%2567%2569%2573%2574%2565%2572%2565%2564%252D%256E%2575%2572%2573%2565%252D%2572%256E%252D%2570%2565%256E%256E%2569%256E%2567%2574%256F%256E%252D%256E%256A%252D%256E%256A%252D%2531%2533%2537%2537%2531%2530%2532%2535%252F%2531%2538%2533%2562%2531%2539%2532%2564%252D%2565%2562%2538%2537%252D%2531%2531%2565%2561%252D%2562%2536%2531%2565%252D%2534%2532%2530%2531%2530%2561%2538%2561%2530%2566%2566%2534%253F%256C%2569%2573%2574%2569%256E%2567%2555%2572%256C%253D%2525%2532%2535%2536%2538%2525%2532%2535%2537%2534%2525%2532%2535%2537%2534%2525%2532%2535%2537%2530%2525%2532%2535%2533%2541%2525%2532%2535%2532%2546%2525%2532%2535%2532%2546%2525%2532%2535%2536%2541%2525%2532%2535%2536%2546%2525%2532%2535%2536%2532%2525%2532%2535%2537%2533%2525%2532%2535%2532%2545%2525%2532%2535%2536%2533%2525%2532%2535%2536%2544%2525%2532%2535%2537%2533%2525%2532%2535%2533%2532%2525%2532%2535%2533%2534%2525%2532%2535%2532%2544%2525%2532%2535%2533%2537%2525%2532%2535%2532%2545%2525%2532%2535%2536%2533%2525%2532%2535%2536%2546%2525%2532%2535%2536%2544%2525%2532%2535%2532%2546%2525%2532%2535%2533%2533%2525%2532%2535%2536%2545%2525%2532%2535%2536%2531%2525%2532%2535%2536%2536%2525%2532%2535%2533%2531%2525%2532%2535%2537%2533%2525%2532%2535%2532%2546%2525%2532%2535%2537%2532%2525%2532%2535%2536%2535%2525%2532%2535%2536%2537%2525%2532%2535%2536%2539%2525%2532%2535%2537%2533%2525%2532%2535%2537%2534%2525%2532%2535%2536%2535%2525%2532%2535%2537%2532%2525%2532%2535%2536%2535%2525%2532%2535%2536%2534%2525%2532%2535%2532%2544%2525%2532%2535%2536%2545%2525%2532%2535%2537%2535%2525%2532%2535%2537%2532%2525%2532%2535%2537%2533%2525%2532%2535%2536%2535%2525%2532%2535%2532%2544%2525%2532%2535%2537%2532%2525%2532%2535%2536%2545%2525%2532%2535%2532%2544%2525%2532%2535%2536%2533%2525%2532%2535%2536%2531%2525%2532%2535%2537%2532%2525%2532%2535%2536%2534%2525%2532%2535%2536%2539%2525%2532%2535%2536%2546%2525%2532%2535%2537%2536%2525%2532%2535%2536%2531%2525%2532%2535%2537%2533%2525%2532%2535%2536%2533%2525%2532%2535%2537%2535%2525%2532%2535%2536%2543%2525%2532%2535%2536%2531%2525%2532%2535%2537%2532%2525%2532%2535%2532%2544%2525%2532%2535%2536%2539%2525%2532%2535%2536%2545%2525%2532%2535%2537%2534%2525%2532%2535%2536%2535%2525%2532%2535%2537%2533%2525%2532%2535%2536%2539%2525%2532%2535%2537%2536%2525%2532%2535%2536%2535%2525%2532%2535%2532%2544%2525%2532%2535%2536%2533%2525%2532%2535%2536%2531%2525%2532%2535%2537%2532%2525%2532%2535%2536%2535%2525%2532%2535%2532%2544%2525%2532%2535%2537%2535%2525%2532%2535%2536%2545%2525%2532%2535%2536%2539%2525%2532%2535%2537%2534%2525%2532%2535%2532%2544%2525%2532%2535%2536%2539%2525%2532%2535%2536%2533%2525%2532%2535%2537%2535%2525%2532%2535%2536%2533%2525%2532%2535%2537%2536%2525%2532%2535%2536%2539%2525%2532%2535%2536%2533%2525%2532%2535%2537%2535%2525%2532%2535%2532%2544%2525%2532%2535%2537%2532%2525%2532%2535%2536%2535%2525%2532%2535%2536%2537%2525%2532%2535%2536%2539%2525%2532%2535%2537%2533%2525%2532%2535%2537%2534%2525%2532%2535%2536%2535%2525%2532%2535%2537%2532%2525%2532%2535%2536%2535%2525%2532%2535%2536%2534%2525%2532%2535%2532%2544%2525%2532%2535%2536%2545%2525%2532%2535%2537%2535%2525%2532%2535%2537%2532%2525%2532%2535%2537%2533%2525%2532%2535%2536%2535%2525%2532%2535%2532%2544%2525%2532%2535%2537%2532%2525%2532%2535%2536%2545%2525%2532%2535%2532%2544%2525%2532%2535%2537%2530%2525%2532%2535%2536%2546%2525%2532%2535%2536%2544%2525%2532%2535%2536%2546%2525%2532%2535%2536%2545%2525%2532%2535%2536%2531%2525%2532%2535%2532%2544%2525%2532%2535%2536%2545%2525%2532%2535%2536%2541%2525%2532%2535%2532%2544%2525%2532%2535%2533%2531%2525%2532%2535%2533%2533%2525%2532%2535%2533%2539%2525%2532%2535%2533%2533%2525%2532%2535%2533%2535%2525%2532%2535%2533%2535%2525%2532%2535%2533%2532%2525%2532%2535%2533%2538%2525%2532%2535%2533%2546%2525%2532%2535%2536%2539%2525%2532%2535%2536%2534%2525%2532%2535%2533%2544%2525%2532%2535%2533%2538%2525%2532%2535%2536%2534%2525%2532%2535%2533%2534%2525%2532%2535%2536%2536%2525%2532%2535%2536%2536%2525%2532%2535%2533%2539%2525%2532%2535%2536%2536%2525%2532%2535%2536%2536%2525%2532%2535%2532%2544%2525%2532%2535%2536%2535%2525%2532%2535%2536%2535%2525%2532%2535%2533%2532%2525%2532%2535%2533%2539%2525%2532%2535%2532%2544%2525%2532%2535%2533%2531%2525%2532%2535%2533%2531%2525%2532%2535%2536%2535%2525%2532%2535%2536%2531%2525%2532%2535%2532%2544%2525%2532%2535%2536%2532%2525%2532%2535%2533%2530%2525%2532%2535%2533%2531%2525%2532%2535%2533%2535%2525%2532%2535%2532%2544%2525%2532%2535%2533%2534%2525%2532%2535%2533%2532%2525%2532%2535%2533%2530%2525%2532%2535%2533%2531%2525%2532%2535%2533%2530%2525%2532%2535%2536%2531%2525%2532%2535%2533%2538%2525%2532%2535%2536%2531%2525%2532%2535%2533%2530%2525%2532%2535%2536%2536%2525%2532%2535%2536%2536%2525%2532%2535%2533%2534, 2021-01-15T13:51:04.000-06:00, 200, , OJHZNR6QGYFDJKBWIJVZPEGKTHZXAKLG, text/html, text/html, UTF-8, eng, , crawl-data/CC-MAIN-2021-04/segments/1610703496947.2/warc/CC-MAIN-20210115194851-20210115224851-00331.warc.gz, 414510034, 46855, 1610703496947.2, CC-MAIN-2021-04, warc] at com.univocity.parsers.common.AbstractWriter.throwExceptionAndClose(AbstractWriter.java:916) AbstractWriter.java:916 at com.univocity.parsers.common.AbstractWriter.writeRow(AbstractWriter.java:706) AbstractWriter.java:706 at org.apache.spark.sql.execution.datasources.csv.UnivocityGenerator.write(UnivocityGenerator.scala:82) UnivocityGenerator.scala:82 at org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.write(CSVFileFormat.scala:139) CSVFileFormat.scala:139 at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:327) FileFormatWriter.scala:327 at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258) FileFormatWriter.scala:258 at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256) FileFormatWriter.scala:256 at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375) Utils.scala:1375 at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261) FileFormatWriter.scala:261 at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191) FileFormatWriter.scala:191 at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190) FileFormatWriter.scala:190 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) ResultTask.scala:87 at org.apache.spark.scheduler.Task.run(Task.scala:108) Task.scala:108 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) Executor.scala:335 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: java.lang.StringIndexOutOfBoundsException: offset 0, count 5292, length 4096 at java.base/java.lang.String.checkBoundsOffCount(String.java:3304) at java.base/java.lang.String.getChars(String.java:855) at com.univocity.parsers.common.input.DefaultCharAppender.append(DefaultCharAppender.java:240) DefaultCharAppender.java:240 at com.univocity.parsers.common.input.ExpandingCharAppender.append(ExpandingCharAppender.java:193) ExpandingCharAppender.java:193 at com.univocity.parsers.csv.CsvWriter.append(CsvWriter.java:296) CsvWriter.java:296 at com.univocity.parsers.csv.CsvWriter.processRow(CsvWriter.java:191) CsvWriter.java:191 at com.univocity.parsers.common.AbstractWriter.submitRow(AbstractWriter.java:316) AbstractWriter.java:316
error occured when trying to read from all CSVs given a year from filtered index, separating by month and writing to multiple folders named for each month.
fixed when running through emr
2020/2021 job postings partitioned by month
After testing on the CC-MAIN-2021-10
to generate queries that answer the following questions, but by a monthly count:
Is there a significant spike in tech job postings at the end of business quarters? If so, which quarter spikes the most?
It was determined that including all crawls from the beginning of 2020 to present was a feasible task.
So, any postings from this point forward refer to the following crawls:
CC-MAIN-2020-05
CC-MAIN-2020-10
CC-MAIN-2020-16
CC-MAIN-2020-24
CC-MAIN-2020-29
CC-MAIN-2020-34
CC-MAIN-2020-40
CC-MAIN-2020-45
CC-MAIN-2020-50
CC-MAIN-2021-04
CC-MAIN-2021-10
CC-MAIN-2021-17
CC-MAIN-2021-21
CC-MAIN-2021-25
CC-MAIN-2021-31
data visualization created for queries, separated by quarter then by month
Below are two graphs created using Tableau from the URL data pulled from the crawl. They are divided into quarters and select by programming language.
Languages vs time in months.
Is there a significant spike in tech job postings at the end of business quarters? If so, which quarter spikes the most?