1086-Maria-Big-Data / JobAdAnalytics

3 stars 2 forks source link

Query: Job Posting Spikes #16

Closed Grantskie closed 3 years ago

Grantskie commented 3 years ago

Is there a significant spike in tech job postings at the end of business quarters? If so, which quarter spikes the most?

mtorres1127 commented 3 years ago

General Tech Job Spikes Count the number of urls in the csv to get overall tech jobs and graph by day. Finding daily/weekly/monthly changes and reporting only those that exceed certain thresholds, possibly with a flat value or percentage change, may be blind towards spikes that happen over multiple days/weeks/months Specific Tech Job Positions Word count by url and select the top three appearances for each week of the Quarter?

vinceecws commented 3 years ago

Currently, this query will be focused on the CC-MAIN-2021-10 crawl (which includes the months of February & March 2021) in order to produce an MVP in the soonest possible time. The stretch goal of processing all crawls in 2021 will be considered once this is good to go

mtorres1127 commented 3 years ago

The following is the query for the programming languages count for a .CSV file. I am having to call a limit on it because it repeats rows.

spark.sql("SELECT (select count(url) from dat where url like '%java%') as java, (SELECT count(url) from dat where url like '%python%') as python, (SELECT count(url) from dat where url like '%scala%') as scala, (SELECT count(url) from dat where url like '%matlab%') as matlab, (SELECT count(url) from dat where url like '%SQL%') as sql from dat limit 1").show()

AngryManlet commented 3 years ago
Internal state when error was thrown: recordCount=11277, recordData=[com,cms24-7,jobs)/job/registered-nurse-rn-medical-surgical-ms-registered-nurse-rn-elizabeth-nj-nj-15364401/381149d2-1866-11eb-bf99-42010a8a003a?listingurl=http://jobs.cms24-7.com/job/registered-nurse-rn-intensive-care-unit-icu-registered-nurse-rn-pennington-nj-nj-13771025/183b192d-eb87-11ea-b61e-42010a8a0ff4?listingurl=http://jobs.cms24-7.com/3naf1s/registered-nurse-rn-cardiovascular-intesive-care-unit-icucvicu-registered-nurse-rn-pomona-nj-13935528?id=8d4ff9ff-ee29-11ea-b015-42010a8a0ff4, https://jobs.cms24-7.com/job/registered-nurse-rn-medical-surgical-ms-registered-nurse-rn-elizabeth-nj-nj-15364401/381149d2-1866-11eb-bf99-42010a8a003a?listingUrl=%2568%2574%2574%2570%253A%252F%252F%256A%256F%2562%2573%252E%2563%256D%2573%2532%2534%252D%2537%252E%2563%256F%256D%252F%256A%256F%2562%252F%2572%2565%2567%2569%2573%2574%2565%2572%2565%2564%252D%256E%2575%2572%2573%2565%252D%2572%256E%252D%2569%256E%2574%2565%256E%2573%2569%2576%2565%252D%2563%2561%2572%2565%252D%2575%256E%2569%2574%252D%2569%2563%2575%252D%2572%2565%2567%2569%2573%2574%2565%2572%2565%2564%252D%256E%2575%2572%2573%2565%252D%2572%256E%252D%2570%2565%256E%256E%2569%256E%2567%2574%256F%256E%252D%256E%256A%252D%256E%256A%252D%2531%2533%2537%2537%2531%2530%2532%2535%252F%2531%2538%2533%2562%2531%2539%2532%2564%252D%2565%2562%2538%2537%252D%2531%2531%2565%2561%252D%2562%2536%2531%2565%252D%2534%2532%2530%2531%2530%2561%2538%2561%2530%2566%2566%2534%253F%256C%2569%2573%2574%2569%256E%2567%2555%2572%256C%253D%2525%2532%2535%2536%2538%2525%2532%2535%2537%2534%2525%2532%2535%2537%2534%2525%2532%2535%2537%2530%2525%2532%2535%2533%2541%2525%2532%2535%2532%2546%2525%2532%2535%2532%2546%2525%2532%2535%2536%2541%2525%2532%2535%2536%2546%2525%2532%2535%2536%2532%2525%2532%2535%2537%2533%2525%2532%2535%2532%2545%2525%2532%2535%2536%2533%2525%2532%2535%2536%2544%2525%2532%2535%2537%2533%2525%2532%2535%2533%2532%2525%2532%2535%2533%2534%2525%2532%2535%2532%2544%2525%2532%2535%2533%2537%2525%2532%2535%2532%2545%2525%2532%2535%2536%2533%2525%2532%2535%2536%2546%2525%2532%2535%2536%2544%2525%2532%2535%2532%2546%2525%2532%2535%2533%2533%2525%2532%2535%2536%2545%2525%2532%2535%2536%2531%2525%2532%2535%2536%2536%2525%2532%2535%2533%2531%2525%2532%2535%2537%2533%2525%2532%2535%2532%2546%2525%2532%2535%2537%2532%2525%2532%2535%2536%2535%2525%2532%2535%2536%2537%2525%2532%2535%2536%2539%2525%2532%2535%2537%2533%2525%2532%2535%2537%2534%2525%2532%2535%2536%2535%2525%2532%2535%2537%2532%2525%2532%2535%2536%2535%2525%2532%2535%2536%2534%2525%2532%2535%2532%2544%2525%2532%2535%2536%2545%2525%2532%2535%2537%2535%2525%2532%2535%2537%2532%2525%2532%2535%2537%2533%2525%2532%2535%2536%2535%2525%2532%2535%2532%2544%2525%2532%2535%2537%2532%2525%2532%2535%2536%2545%2525%2532%2535%2532%2544%2525%2532%2535%2536%2533%2525%2532%2535%2536%2531%2525%2532%2535%2537%2532%2525%2532%2535%2536%2534%2525%2532%2535%2536%2539%2525%2532%2535%2536%2546%2525%2532%2535%2537%2536%2525%2532%2535%2536%2531%2525%2532%2535%2537%2533%2525%2532%2535%2536%2533%2525%2532%2535%2537%2535%2525%2532%2535%2536%2543%2525%2532%2535%2536%2531%2525%2532%2535%2537%2532%2525%2532%2535%2532%2544%2525%2532%2535%2536%2539%2525%2532%2535%2536%2545%2525%2532%2535%2537%2534%2525%2532%2535%2536%2535%2525%2532%2535%2537%2533%2525%2532%2535%2536%2539%2525%2532%2535%2537%2536%2525%2532%2535%2536%2535%2525%2532%2535%2532%2544%2525%2532%2535%2536%2533%2525%2532%2535%2536%2531%2525%2532%2535%2537%2532%2525%2532%2535%2536%2535%2525%2532%2535%2532%2544%2525%2532%2535%2537%2535%2525%2532%2535%2536%2545%2525%2532%2535%2536%2539%2525%2532%2535%2537%2534%2525%2532%2535%2532%2544%2525%2532%2535%2536%2539%2525%2532%2535%2536%2533%2525%2532%2535%2537%2535%2525%2532%2535%2536%2533%2525%2532%2535%2537%2536%2525%2532%2535%2536%2539%2525%2532%2535%2536%2533%2525%2532%2535%2537%2535%2525%2532%2535%2532%2544%2525%2532%2535%2537%2532%2525%2532%2535%2536%2535%2525%2532%2535%2536%2537%2525%2532%2535%2536%2539%2525%2532%2535%2537%2533%2525%2532%2535%2537%2534%2525%2532%2535%2536%2535%2525%2532%2535%2537%2532%2525%2532%2535%2536%2535%2525%2532%2535%2536%2534%2525%2532%2535%2532%2544%2525%2532%2535%2536%2545%2525%2532%2535%2537%2535%2525%2532%2535%2537%2532%2525%2532%2535%2537%2533%2525%2532%2535%2536%2535%2525%2532%2535%2532%2544%2525%2532%2535%2537%2532%2525%2532%2535%2536%2545%2525%2532%2535%2532%2544%2525%2532%2535%2537%2530%2525%2532%2535%2536%2546%2525%2532%2535%2536%2544%2525%2532%2535%2536%2546%2525%2532%2535%2536%2545%2525%2532%2535%2536%2531%2525%2532%2535%2532%2544%2525%2532%2535%2536%2545%2525%2532%2535%2536%2541%2525%2532%2535%2532%2544%2525%2532%2535%2533%2531%2525%2532%2535%2533%2533%2525%2532%2535%2533%2539%2525%2532%2535%2533%2533%2525%2532%2535%2533%2535%2525%2532%2535%2533%2535%2525%2532%2535%2533%2532%2525%2532%2535%2533%2538%2525%2532%2535%2533%2546%2525%2532%2535%2536%2539%2525%2532%2535%2536%2534%2525%2532%2535%2533%2544%2525%2532%2535%2533%2538%2525%2532%2535%2536%2534%2525%2532%2535%2533%2534%2525%2532%2535%2536%2536%2525%2532%2535%2536%2536%2525%2532%2535%2533%2539%2525%2532%2535%2536%2536%2525%2532%2535%2536%2536%2525%2532%2535%2532%2544%2525%2532%2535%2536%2535%2525%2532%2535%2536%2535%2525%2532%2535%2533%2532%2525%2532%2535%2533%2539%2525%2532%2535%2532%2544%2525%2532%2535%2533%2531%2525%2532%2535%2533%2531%2525%2532%2535%2536%2535%2525%2532%2535%2536%2531%2525%2532%2535%2532%2544%2525%2532%2535%2536%2532%2525%2532%2535%2533%2530%2525%2532%2535%2533%2531%2525%2532%2535%2533%2535%2525%2532%2535%2532%2544%2525%2532%2535%2533%2534%2525%2532%2535%2533%2532%2525%2532%2535%2533%2530%2525%2532%2535%2533%2531%2525%2532%2535%2533%2530%2525%2532%2535%2536%2531%2525%2532%2535%2533%2538%2525%2532%2535%2536%2531%2525%2532%2535%2533%2530%2525%2532%2535%2536%2536%2525%2532%2535%2536%2536%2525%2532%2535%2533%2534, jobs.cms24-7.com, com, cms24-7, jobs, , , com, cms24-7.com, com, cms24-7.com, https, , /job/registered-nurse-rn-medical-surgical-ms-registered-nurse-rn-elizabeth-nj-nj-15364401/381149d2-1866-11eb-bf99-42010a8a003a, listingUrl=%2568%2574%2574%2570%253A%252F%252F%256A%256F%2562%2573%252E%2563%256D%2573%2532%2534%252D%2537%252E%2563%256F%256D%252F%256A%256F%2562%252F%2572%2565%2567%2569%2573%2574%2565%2572%2565%2564%252D%256E%2575%2572%2573%2565%252D%2572%256E%252D%2569%256E%2574%2565%256E%2573%2569%2576%2565%252D%2563%2561%2572%2565%252D%2575%256E%2569%2574%252D%2569%2563%2575%252D%2572%2565%2567%2569%2573%2574%2565%2572%2565%2564%252D%256E%2575%2572%2573%2565%252D%2572%256E%252D%2570%2565%256E%256E%2569%256E%2567%2574%256F%256E%252D%256E%256A%252D%256E%256A%252D%2531%2533%2537%2537%2531%2530%2532%2535%252F%2531%2538%2533%2562%2531%2539%2532%2564%252D%2565%2562%2538%2537%252D%2531%2531%2565%2561%252D%2562%2536%2531%2565%252D%2534%2532%2530%2531%2530%2561%2538%2561%2530%2566%2566%2534%253F%256C%2569%2573%2574%2569%256E%2567%2555%2572%256C%253D%2525%2532%2535%2536%2538%2525%2532%2535%2537%2534%2525%2532%2535%2537%2534%2525%2532%2535%2537%2530%2525%2532%2535%2533%2541%2525%2532%2535%2532%2546%2525%2532%2535%2532%2546%2525%2532%2535%2536%2541%2525%2532%2535%2536%2546%2525%2532%2535%2536%2532%2525%2532%2535%2537%2533%2525%2532%2535%2532%2545%2525%2532%2535%2536%2533%2525%2532%2535%2536%2544%2525%2532%2535%2537%2533%2525%2532%2535%2533%2532%2525%2532%2535%2533%2534%2525%2532%2535%2532%2544%2525%2532%2535%2533%2537%2525%2532%2535%2532%2545%2525%2532%2535%2536%2533%2525%2532%2535%2536%2546%2525%2532%2535%2536%2544%2525%2532%2535%2532%2546%2525%2532%2535%2533%2533%2525%2532%2535%2536%2545%2525%2532%2535%2536%2531%2525%2532%2535%2536%2536%2525%2532%2535%2533%2531%2525%2532%2535%2537%2533%2525%2532%2535%2532%2546%2525%2532%2535%2537%2532%2525%2532%2535%2536%2535%2525%2532%2535%2536%2537%2525%2532%2535%2536%2539%2525%2532%2535%2537%2533%2525%2532%2535%2537%2534%2525%2532%2535%2536%2535%2525%2532%2535%2537%2532%2525%2532%2535%2536%2535%2525%2532%2535%2536%2534%2525%2532%2535%2532%2544%2525%2532%2535%2536%2545%2525%2532%2535%2537%2535%2525%2532%2535%2537%2532%2525%2532%2535%2537%2533%2525%2532%2535%2536%2535%2525%2532%2535%2532%2544%2525%2532%2535%2537%2532%2525%2532%2535%2536%2545%2525%2532%2535%2532%2544%2525%2532%2535%2536%2533%2525%2532%2535%2536%2531%2525%2532%2535%2537%2532%2525%2532%2535%2536%2534%2525%2532%2535%2536%2539%2525%2532%2535%2536%2546%2525%2532%2535%2537%2536%2525%2532%2535%2536%2531%2525%2532%2535%2537%2533%2525%2532%2535%2536%2533%2525%2532%2535%2537%2535%2525%2532%2535%2536%2543%2525%2532%2535%2536%2531%2525%2532%2535%2537%2532%2525%2532%2535%2532%2544%2525%2532%2535%2536%2539%2525%2532%2535%2536%2545%2525%2532%2535%2537%2534%2525%2532%2535%2536%2535%2525%2532%2535%2537%2533%2525%2532%2535%2536%2539%2525%2532%2535%2537%2536%2525%2532%2535%2536%2535%2525%2532%2535%2532%2544%2525%2532%2535%2536%2533%2525%2532%2535%2536%2531%2525%2532%2535%2537%2532%2525%2532%2535%2536%2535%2525%2532%2535%2532%2544%2525%2532%2535%2537%2535%2525%2532%2535%2536%2545%2525%2532%2535%2536%2539%2525%2532%2535%2537%2534%2525%2532%2535%2532%2544%2525%2532%2535%2536%2539%2525%2532%2535%2536%2533%2525%2532%2535%2537%2535%2525%2532%2535%2536%2533%2525%2532%2535%2537%2536%2525%2532%2535%2536%2539%2525%2532%2535%2536%2533%2525%2532%2535%2537%2535%2525%2532%2535%2532%2544%2525%2532%2535%2537%2532%2525%2532%2535%2536%2535%2525%2532%2535%2536%2537%2525%2532%2535%2536%2539%2525%2532%2535%2537%2533%2525%2532%2535%2537%2534%2525%2532%2535%2536%2535%2525%2532%2535%2537%2532%2525%2532%2535%2536%2535%2525%2532%2535%2536%2534%2525%2532%2535%2532%2544%2525%2532%2535%2536%2545%2525%2532%2535%2537%2535%2525%2532%2535%2537%2532%2525%2532%2535%2537%2533%2525%2532%2535%2536%2535%2525%2532%2535%2532%2544%2525%2532%2535%2537%2532%2525%2532%2535%2536%2545%2525%2532%2535%2532%2544%2525%2532%2535%2537%2530%2525%2532%2535%2536%2546%2525%2532%2535%2536%2544%2525%2532%2535%2536%2546%2525%2532%2535%2536%2545%2525%2532%2535%2536%2531%2525%2532%2535%2532%2544%2525%2532%2535%2536%2545%2525%2532%2535%2536%2541%2525%2532%2535%2532%2544%2525%2532%2535%2533%2531%2525%2532%2535%2533%2533%2525%2532%2535%2533%2539%2525%2532%2535%2533%2533%2525%2532%2535%2533%2535%2525%2532%2535%2533%2535%2525%2532%2535%2533%2532%2525%2532%2535%2533%2538%2525%2532%2535%2533%2546%2525%2532%2535%2536%2539%2525%2532%2535%2536%2534%2525%2532%2535%2533%2544%2525%2532%2535%2533%2538%2525%2532%2535%2536%2534%2525%2532%2535%2533%2534%2525%2532%2535%2536%2536%2525%2532%2535%2536%2536%2525%2532%2535%2533%2539%2525%2532%2535%2536%2536%2525%2532%2535%2536%2536%2525%2532%2535%2532%2544%2525%2532%2535%2536%2535%2525%2532%2535%2536%2535%2525%2532%2535%2533%2532%2525%2532%2535%2533%2539%2525%2532%2535%2532%2544%2525%2532%2535%2533%2531%2525%2532%2535%2533%2531%2525%2532%2535%2536%2535%2525%2532%2535%2536%2531%2525%2532%2535%2532%2544%2525%2532%2535%2536%2532%2525%2532%2535%2533%2530%2525%2532%2535%2533%2531%2525%2532%2535%2533%2535%2525%2532%2535%2532%2544%2525%2532%2535%2533%2534%2525%2532%2535%2533%2532%2525%2532%2535%2533%2530%2525%2532%2535%2533%2531%2525%2532%2535%2533%2530%2525%2532%2535%2536%2531%2525%2532%2535%2533%2538%2525%2532%2535%2536%2531%2525%2532%2535%2533%2530%2525%2532%2535%2536%2536%2525%2532%2535%2536%2536%2525%2532%2535%2533%2534, 2021-01-15T13:51:04.000-06:00, 200, , OJHZNR6QGYFDJKBWIJVZPEGKTHZXAKLG, text/html, text/html, UTF-8, eng, , crawl-data/CC-MAIN-2021-04/segments/1610703496947.2/warc/CC-MAIN-20210115194851-20210115224851-00331.warc.gz, 414510034, 46855, 1610703496947.2, CC-MAIN-2021-04, warc]
at com.univocity.parsers.common.AbstractWriter.throwExceptionAndClose(AbstractWriter.java:916)
AbstractWriter.java:916
    at com.univocity.parsers.common.AbstractWriter.writeRow(AbstractWriter.java:706)
AbstractWriter.java:706
    at org.apache.spark.sql.execution.datasources.csv.UnivocityGenerator.write(UnivocityGenerator.scala:82)
UnivocityGenerator.scala:82
    at org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.write(CSVFileFormat.scala:139)
CSVFileFormat.scala:139
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:327)
FileFormatWriter.scala:327
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
FileFormatWriter.scala:258
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
FileFormatWriter.scala:256
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
Utils.scala:1375
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
FileFormatWriter.scala:261
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
FileFormatWriter.scala:191
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
FileFormatWriter.scala:190
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
ResultTask.scala:87
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
Task.scala:108
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
Executor.scala:335
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.StringIndexOutOfBoundsException: offset 0, count 5292, length 4096
    at java.base/java.lang.String.checkBoundsOffCount(String.java:3304)
    at java.base/java.lang.String.getChars(String.java:855)
    at com.univocity.parsers.common.input.DefaultCharAppender.append(DefaultCharAppender.java:240)
DefaultCharAppender.java:240
    at com.univocity.parsers.common.input.ExpandingCharAppender.append(ExpandingCharAppender.java:193)
ExpandingCharAppender.java:193
    at com.univocity.parsers.csv.CsvWriter.append(CsvWriter.java:296)
CsvWriter.java:296
    at com.univocity.parsers.csv.CsvWriter.processRow(CsvWriter.java:191)
CsvWriter.java:191
    at com.univocity.parsers.common.AbstractWriter.submitRow(AbstractWriter.java:316)
AbstractWriter.java:316

error occured when trying to read from all CSVs given a year from filtered index, separating by month and writing to multiple folders named for each month.

AngryManlet commented 3 years ago
Internal state when error was thrown: recordCount=11277, recordData=[com,cms24-7,jobs)/job/registered-nurse-rn-medical-surgical-ms-registered-nurse-rn-elizabeth-nj-nj-15364401/381149d2-1866-11eb-bf99-42010a8a003a?listingurl=http://jobs.cms24-7.com/job/registered-nurse-rn-intensive-care-unit-icu-registered-nurse-rn-pennington-nj-nj-13771025/183b192d-eb87-11ea-b61e-42010a8a0ff4?listingurl=http://jobs.cms24-7.com/3naf1s/registered-nurse-rn-cardiovascular-intesive-care-unit-icucvicu-registered-nurse-rn-pomona-nj-13935528?id=8d4ff9ff-ee29-11ea-b015-42010a8a0ff4, https://jobs.cms24-7.com/job/registered-nurse-rn-medical-surgical-ms-registered-nurse-rn-elizabeth-nj-nj-15364401/381149d2-1866-11eb-bf99-42010a8a003a?listingUrl=%2568%2574%2574%2570%253A%252F%252F%256A%256F%2562%2573%252E%2563%256D%2573%2532%2534%252D%2537%252E%2563%256F%256D%252F%256A%256F%2562%252F%2572%2565%2567%2569%2573%2574%2565%2572%2565%2564%252D%256E%2575%2572%2573%2565%252D%2572%256E%252D%2569%256E%2574%2565%256E%2573%2569%2576%2565%252D%2563%2561%2572%2565%252D%2575%256E%2569%2574%252D%2569%2563%2575%252D%2572%2565%2567%2569%2573%2574%2565%2572%2565%2564%252D%256E%2575%2572%2573%2565%252D%2572%256E%252D%2570%2565%256E%256E%2569%256E%2567%2574%256F%256E%252D%256E%256A%252D%256E%256A%252D%2531%2533%2537%2537%2531%2530%2532%2535%252F%2531%2538%2533%2562%2531%2539%2532%2564%252D%2565%2562%2538%2537%252D%2531%2531%2565%2561%252D%2562%2536%2531%2565%252D%2534%2532%2530%2531%2530%2561%2538%2561%2530%2566%2566%2534%253F%256C%2569%2573%2574%2569%256E%2567%2555%2572%256C%253D%2525%2532%2535%2536%2538%2525%2532%2535%2537%2534%2525%2532%2535%2537%2534%2525%2532%2535%2537%2530%2525%2532%2535%2533%2541%2525%2532%2535%2532%2546%2525%2532%2535%2532%2546%2525%2532%2535%2536%2541%2525%2532%2535%2536%2546%2525%2532%2535%2536%2532%2525%2532%2535%2537%2533%2525%2532%2535%2532%2545%2525%2532%2535%2536%2533%2525%2532%2535%2536%2544%2525%2532%2535%2537%2533%2525%2532%2535%2533%2532%2525%2532%2535%2533%2534%2525%2532%2535%2532%2544%2525%2532%2535%2533%2537%2525%2532%2535%2532%2545%2525%2532%2535%2536%2533%2525%2532%2535%2536%2546%2525%2532%2535%2536%2544%2525%2532%2535%2532%2546%2525%2532%2535%2533%2533%2525%2532%2535%2536%2545%2525%2532%2535%2536%2531%2525%2532%2535%2536%2536%2525%2532%2535%2533%2531%2525%2532%2535%2537%2533%2525%2532%2535%2532%2546%2525%2532%2535%2537%2532%2525%2532%2535%2536%2535%2525%2532%2535%2536%2537%2525%2532%2535%2536%2539%2525%2532%2535%2537%2533%2525%2532%2535%2537%2534%2525%2532%2535%2536%2535%2525%2532%2535%2537%2532%2525%2532%2535%2536%2535%2525%2532%2535%2536%2534%2525%2532%2535%2532%2544%2525%2532%2535%2536%2545%2525%2532%2535%2537%2535%2525%2532%2535%2537%2532%2525%2532%2535%2537%2533%2525%2532%2535%2536%2535%2525%2532%2535%2532%2544%2525%2532%2535%2537%2532%2525%2532%2535%2536%2545%2525%2532%2535%2532%2544%2525%2532%2535%2536%2533%2525%2532%2535%2536%2531%2525%2532%2535%2537%2532%2525%2532%2535%2536%2534%2525%2532%2535%2536%2539%2525%2532%2535%2536%2546%2525%2532%2535%2537%2536%2525%2532%2535%2536%2531%2525%2532%2535%2537%2533%2525%2532%2535%2536%2533%2525%2532%2535%2537%2535%2525%2532%2535%2536%2543%2525%2532%2535%2536%2531%2525%2532%2535%2537%2532%2525%2532%2535%2532%2544%2525%2532%2535%2536%2539%2525%2532%2535%2536%2545%2525%2532%2535%2537%2534%2525%2532%2535%2536%2535%2525%2532%2535%2537%2533%2525%2532%2535%2536%2539%2525%2532%2535%2537%2536%2525%2532%2535%2536%2535%2525%2532%2535%2532%2544%2525%2532%2535%2536%2533%2525%2532%2535%2536%2531%2525%2532%2535%2537%2532%2525%2532%2535%2536%2535%2525%2532%2535%2532%2544%2525%2532%2535%2537%2535%2525%2532%2535%2536%2545%2525%2532%2535%2536%2539%2525%2532%2535%2537%2534%2525%2532%2535%2532%2544%2525%2532%2535%2536%2539%2525%2532%2535%2536%2533%2525%2532%2535%2537%2535%2525%2532%2535%2536%2533%2525%2532%2535%2537%2536%2525%2532%2535%2536%2539%2525%2532%2535%2536%2533%2525%2532%2535%2537%2535%2525%2532%2535%2532%2544%2525%2532%2535%2537%2532%2525%2532%2535%2536%2535%2525%2532%2535%2536%2537%2525%2532%2535%2536%2539%2525%2532%2535%2537%2533%2525%2532%2535%2537%2534%2525%2532%2535%2536%2535%2525%2532%2535%2537%2532%2525%2532%2535%2536%2535%2525%2532%2535%2536%2534%2525%2532%2535%2532%2544%2525%2532%2535%2536%2545%2525%2532%2535%2537%2535%2525%2532%2535%2537%2532%2525%2532%2535%2537%2533%2525%2532%2535%2536%2535%2525%2532%2535%2532%2544%2525%2532%2535%2537%2532%2525%2532%2535%2536%2545%2525%2532%2535%2532%2544%2525%2532%2535%2537%2530%2525%2532%2535%2536%2546%2525%2532%2535%2536%2544%2525%2532%2535%2536%2546%2525%2532%2535%2536%2545%2525%2532%2535%2536%2531%2525%2532%2535%2532%2544%2525%2532%2535%2536%2545%2525%2532%2535%2536%2541%2525%2532%2535%2532%2544%2525%2532%2535%2533%2531%2525%2532%2535%2533%2533%2525%2532%2535%2533%2539%2525%2532%2535%2533%2533%2525%2532%2535%2533%2535%2525%2532%2535%2533%2535%2525%2532%2535%2533%2532%2525%2532%2535%2533%2538%2525%2532%2535%2533%2546%2525%2532%2535%2536%2539%2525%2532%2535%2536%2534%2525%2532%2535%2533%2544%2525%2532%2535%2533%2538%2525%2532%2535%2536%2534%2525%2532%2535%2533%2534%2525%2532%2535%2536%2536%2525%2532%2535%2536%2536%2525%2532%2535%2533%2539%2525%2532%2535%2536%2536%2525%2532%2535%2536%2536%2525%2532%2535%2532%2544%2525%2532%2535%2536%2535%2525%2532%2535%2536%2535%2525%2532%2535%2533%2532%2525%2532%2535%2533%2539%2525%2532%2535%2532%2544%2525%2532%2535%2533%2531%2525%2532%2535%2533%2531%2525%2532%2535%2536%2535%2525%2532%2535%2536%2531%2525%2532%2535%2532%2544%2525%2532%2535%2536%2532%2525%2532%2535%2533%2530%2525%2532%2535%2533%2531%2525%2532%2535%2533%2535%2525%2532%2535%2532%2544%2525%2532%2535%2533%2534%2525%2532%2535%2533%2532%2525%2532%2535%2533%2530%2525%2532%2535%2533%2531%2525%2532%2535%2533%2530%2525%2532%2535%2536%2531%2525%2532%2535%2533%2538%2525%2532%2535%2536%2531%2525%2532%2535%2533%2530%2525%2532%2535%2536%2536%2525%2532%2535%2536%2536%2525%2532%2535%2533%2534, jobs.cms24-7.com, com, cms24-7, jobs, , , com, cms24-7.com, com, cms24-7.com, https, , /job/registered-nurse-rn-medical-surgical-ms-registered-nurse-rn-elizabeth-nj-nj-15364401/381149d2-1866-11eb-bf99-42010a8a003a, listingUrl=%2568%2574%2574%2570%253A%252F%252F%256A%256F%2562%2573%252E%2563%256D%2573%2532%2534%252D%2537%252E%2563%256F%256D%252F%256A%256F%2562%252F%2572%2565%2567%2569%2573%2574%2565%2572%2565%2564%252D%256E%2575%2572%2573%2565%252D%2572%256E%252D%2569%256E%2574%2565%256E%2573%2569%2576%2565%252D%2563%2561%2572%2565%252D%2575%256E%2569%2574%252D%2569%2563%2575%252D%2572%2565%2567%2569%2573%2574%2565%2572%2565%2564%252D%256E%2575%2572%2573%2565%252D%2572%256E%252D%2570%2565%256E%256E%2569%256E%2567%2574%256F%256E%252D%256E%256A%252D%256E%256A%252D%2531%2533%2537%2537%2531%2530%2532%2535%252F%2531%2538%2533%2562%2531%2539%2532%2564%252D%2565%2562%2538%2537%252D%2531%2531%2565%2561%252D%2562%2536%2531%2565%252D%2534%2532%2530%2531%2530%2561%2538%2561%2530%2566%2566%2534%253F%256C%2569%2573%2574%2569%256E%2567%2555%2572%256C%253D%2525%2532%2535%2536%2538%2525%2532%2535%2537%2534%2525%2532%2535%2537%2534%2525%2532%2535%2537%2530%2525%2532%2535%2533%2541%2525%2532%2535%2532%2546%2525%2532%2535%2532%2546%2525%2532%2535%2536%2541%2525%2532%2535%2536%2546%2525%2532%2535%2536%2532%2525%2532%2535%2537%2533%2525%2532%2535%2532%2545%2525%2532%2535%2536%2533%2525%2532%2535%2536%2544%2525%2532%2535%2537%2533%2525%2532%2535%2533%2532%2525%2532%2535%2533%2534%2525%2532%2535%2532%2544%2525%2532%2535%2533%2537%2525%2532%2535%2532%2545%2525%2532%2535%2536%2533%2525%2532%2535%2536%2546%2525%2532%2535%2536%2544%2525%2532%2535%2532%2546%2525%2532%2535%2533%2533%2525%2532%2535%2536%2545%2525%2532%2535%2536%2531%2525%2532%2535%2536%2536%2525%2532%2535%2533%2531%2525%2532%2535%2537%2533%2525%2532%2535%2532%2546%2525%2532%2535%2537%2532%2525%2532%2535%2536%2535%2525%2532%2535%2536%2537%2525%2532%2535%2536%2539%2525%2532%2535%2537%2533%2525%2532%2535%2537%2534%2525%2532%2535%2536%2535%2525%2532%2535%2537%2532%2525%2532%2535%2536%2535%2525%2532%2535%2536%2534%2525%2532%2535%2532%2544%2525%2532%2535%2536%2545%2525%2532%2535%2537%2535%2525%2532%2535%2537%2532%2525%2532%2535%2537%2533%2525%2532%2535%2536%2535%2525%2532%2535%2532%2544%2525%2532%2535%2537%2532%2525%2532%2535%2536%2545%2525%2532%2535%2532%2544%2525%2532%2535%2536%2533%2525%2532%2535%2536%2531%2525%2532%2535%2537%2532%2525%2532%2535%2536%2534%2525%2532%2535%2536%2539%2525%2532%2535%2536%2546%2525%2532%2535%2537%2536%2525%2532%2535%2536%2531%2525%2532%2535%2537%2533%2525%2532%2535%2536%2533%2525%2532%2535%2537%2535%2525%2532%2535%2536%2543%2525%2532%2535%2536%2531%2525%2532%2535%2537%2532%2525%2532%2535%2532%2544%2525%2532%2535%2536%2539%2525%2532%2535%2536%2545%2525%2532%2535%2537%2534%2525%2532%2535%2536%2535%2525%2532%2535%2537%2533%2525%2532%2535%2536%2539%2525%2532%2535%2537%2536%2525%2532%2535%2536%2535%2525%2532%2535%2532%2544%2525%2532%2535%2536%2533%2525%2532%2535%2536%2531%2525%2532%2535%2537%2532%2525%2532%2535%2536%2535%2525%2532%2535%2532%2544%2525%2532%2535%2537%2535%2525%2532%2535%2536%2545%2525%2532%2535%2536%2539%2525%2532%2535%2537%2534%2525%2532%2535%2532%2544%2525%2532%2535%2536%2539%2525%2532%2535%2536%2533%2525%2532%2535%2537%2535%2525%2532%2535%2536%2533%2525%2532%2535%2537%2536%2525%2532%2535%2536%2539%2525%2532%2535%2536%2533%2525%2532%2535%2537%2535%2525%2532%2535%2532%2544%2525%2532%2535%2537%2532%2525%2532%2535%2536%2535%2525%2532%2535%2536%2537%2525%2532%2535%2536%2539%2525%2532%2535%2537%2533%2525%2532%2535%2537%2534%2525%2532%2535%2536%2535%2525%2532%2535%2537%2532%2525%2532%2535%2536%2535%2525%2532%2535%2536%2534%2525%2532%2535%2532%2544%2525%2532%2535%2536%2545%2525%2532%2535%2537%2535%2525%2532%2535%2537%2532%2525%2532%2535%2537%2533%2525%2532%2535%2536%2535%2525%2532%2535%2532%2544%2525%2532%2535%2537%2532%2525%2532%2535%2536%2545%2525%2532%2535%2532%2544%2525%2532%2535%2537%2530%2525%2532%2535%2536%2546%2525%2532%2535%2536%2544%2525%2532%2535%2536%2546%2525%2532%2535%2536%2545%2525%2532%2535%2536%2531%2525%2532%2535%2532%2544%2525%2532%2535%2536%2545%2525%2532%2535%2536%2541%2525%2532%2535%2532%2544%2525%2532%2535%2533%2531%2525%2532%2535%2533%2533%2525%2532%2535%2533%2539%2525%2532%2535%2533%2533%2525%2532%2535%2533%2535%2525%2532%2535%2533%2535%2525%2532%2535%2533%2532%2525%2532%2535%2533%2538%2525%2532%2535%2533%2546%2525%2532%2535%2536%2539%2525%2532%2535%2536%2534%2525%2532%2535%2533%2544%2525%2532%2535%2533%2538%2525%2532%2535%2536%2534%2525%2532%2535%2533%2534%2525%2532%2535%2536%2536%2525%2532%2535%2536%2536%2525%2532%2535%2533%2539%2525%2532%2535%2536%2536%2525%2532%2535%2536%2536%2525%2532%2535%2532%2544%2525%2532%2535%2536%2535%2525%2532%2535%2536%2535%2525%2532%2535%2533%2532%2525%2532%2535%2533%2539%2525%2532%2535%2532%2544%2525%2532%2535%2533%2531%2525%2532%2535%2533%2531%2525%2532%2535%2536%2535%2525%2532%2535%2536%2531%2525%2532%2535%2532%2544%2525%2532%2535%2536%2532%2525%2532%2535%2533%2530%2525%2532%2535%2533%2531%2525%2532%2535%2533%2535%2525%2532%2535%2532%2544%2525%2532%2535%2533%2534%2525%2532%2535%2533%2532%2525%2532%2535%2533%2530%2525%2532%2535%2533%2531%2525%2532%2535%2533%2530%2525%2532%2535%2536%2531%2525%2532%2535%2533%2538%2525%2532%2535%2536%2531%2525%2532%2535%2533%2530%2525%2532%2535%2536%2536%2525%2532%2535%2536%2536%2525%2532%2535%2533%2534, 2021-01-15T13:51:04.000-06:00, 200, , OJHZNR6QGYFDJKBWIJVZPEGKTHZXAKLG, text/html, text/html, UTF-8, eng, , crawl-data/CC-MAIN-2021-04/segments/1610703496947.2/warc/CC-MAIN-20210115194851-20210115224851-00331.warc.gz, 414510034, 46855, 1610703496947.2, CC-MAIN-2021-04, warc]
at com.univocity.parsers.common.AbstractWriter.throwExceptionAndClose(AbstractWriter.java:916)
AbstractWriter.java:916
  at com.univocity.parsers.common.AbstractWriter.writeRow(AbstractWriter.java:706)
AbstractWriter.java:706
  at org.apache.spark.sql.execution.datasources.csv.UnivocityGenerator.write(UnivocityGenerator.scala:82)
UnivocityGenerator.scala:82
  at org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.write(CSVFileFormat.scala:139)
CSVFileFormat.scala:139
  at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:327)
FileFormatWriter.scala:327
  at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
FileFormatWriter.scala:258
  at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
FileFormatWriter.scala:256
  at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
Utils.scala:1375
  at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
FileFormatWriter.scala:261
  at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
FileFormatWriter.scala:191
  at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
FileFormatWriter.scala:190
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
ResultTask.scala:87
  at org.apache.spark.scheduler.Task.run(Task.scala:108)
Task.scala:108
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
Executor.scala:335
  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
  at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.StringIndexOutOfBoundsException: offset 0, count 5292, length 4096
  at java.base/java.lang.String.checkBoundsOffCount(String.java:3304)
  at java.base/java.lang.String.getChars(String.java:855)
  at com.univocity.parsers.common.input.DefaultCharAppender.append(DefaultCharAppender.java:240)
DefaultCharAppender.java:240
  at com.univocity.parsers.common.input.ExpandingCharAppender.append(ExpandingCharAppender.java:193)
ExpandingCharAppender.java:193
  at com.univocity.parsers.csv.CsvWriter.append(CsvWriter.java:296)
CsvWriter.java:296
  at com.univocity.parsers.csv.CsvWriter.processRow(CsvWriter.java:191)
CsvWriter.java:191
  at com.univocity.parsers.common.AbstractWriter.submitRow(AbstractWriter.java:316)
AbstractWriter.java:316

error occured when trying to read from all CSVs given a year from filtered index, separating by month and writing to multiple folders named for each month.

fixed when running through emr

AngryManlet commented 3 years ago

2020/2021 job postings partitioned by month

vinceecws commented 3 years ago

After testing on the CC-MAIN-2021-10 to generate queries that answer the following questions, but by a monthly count:

Is there a significant spike in tech job postings at the end of business quarters? If so, which quarter spikes the most?

It was determined that including all crawls from the beginning of 2020 to present was a feasible task.

So, any postings from this point forward refer to the following crawls:

CC-MAIN-2020-05 
CC-MAIN-2020-10 
CC-MAIN-2020-16 
CC-MAIN-2020-24 
CC-MAIN-2020-29 
CC-MAIN-2020-34 
CC-MAIN-2020-40 
CC-MAIN-2020-45 
CC-MAIN-2020-50 
CC-MAIN-2021-04 
CC-MAIN-2021-10 
CC-MAIN-2021-17 
CC-MAIN-2021-21 
CC-MAIN-2021-25 
CC-MAIN-2021-31
AngryManlet commented 3 years ago

data visualization created for queries, separated by quarter then by month

mtorres1127 commented 3 years ago

Below are two graphs created using Tableau from the URL data pulled from the crawl. They are divided into quarters and select by programming language. 2020 2021

mtorres1127 commented 3 years ago

Languages vs time in months. line