Closed GabrielMichaelKlein closed 3 years ago
Schema of the filteredIndex should retain the original schema as defined here: https://github.com/1086-Maria-Big-Data/JobAdAnalytics/blob/883b3543dfff175fc3073b318236280291b392e3/src/main/scala/cc/idx/indexUtil.scala#L22-L54
@GabrielMichaelKlein could you clarify how/why it should be saved as #parquet files? It's currently partitioned into multiple (24, I think) .csv
files for read/write concurrency.
Include all columns when filtering WARC files for more flexibility. Save as Parquet files.
My mistake, honestly don't know why I included that. I think I was just thinking keep it as partitioned CSV files.
I'm working on trying to debug the regex now but I'll make this change once I figure that out and then will rerun the process to write the CSVs. I think it should be pretty simple?
Originally I did
`val df = IndexUtil.load(spark).repartition(initial_partition)
return df
.select("url_surtkey", "fetch_time", "url", "content_mime_type", "fetch_status", "content_digest", "fetch_redirect", "warc_segment", "warc_record_length", "warc_record_offset", "warc_filename")
.where(col("crawl") === "CC-MAIN-2021-10" && col("subset") === "warc").cache
.where(col("fetch_status") === 200).cache
.where(col("content_languages") === "eng").cache
.where(col("content_mime_type") === "text/html")
.where(col("url_host_tld") === "com")
.where(col("url_path").rlike("(?i)^(?=.*(jobs\\.|careers\\.|/job[s]{0,1}/|/career[s]{0,1}/))(?=.*(" + techJobTerms.mkString("|") + ")).*$"))
`
IndexUtil.load creates a dataframe with the correct structure right? So i just need to take out the .select and it should give us the right csv structure
@Ahimsaka Yep, that looks good to me. Also, we do need some folder structure for the results, if we're planning to extend this to all crawls. I suggest we have the folder structured as such:
/filtered_index
/CC-MAIN-2020-34
/part-00000-bb6fa4ba-4e14-49d3-985c-e570505dc35d-c000.csv
/part-00001-bb6fa4ba-4e14-49d3-985c-e570505dc35d-c000.csv
/part-00002-bb6fa4ba-4e14-49d3-985c-e570505dc35d-c000.csv
/part-00003-bb6fa4ba-4e14-49d3-985c-e570505dc35d-c000.csv
.
.
.
/CC-MAIN-2020-40
/part-00000-bb6fa4ba-4e14-49d3-985c-e570505dc35d-c000.csv
/part-00001-bb6fa4ba-4e14-49d3-985c-e570505dc35d-c000.csv
/part-00002-bb6fa4ba-4e14-49d3-985c-e570505dc35d-c000.csv
/part-00003-bb6fa4ba-4e14-49d3-985c-e570505dc35d-c000.csv
.
.
.
/CC-MAIN-2020-45
/part-00000-bb6fa4ba-4e14-49d3-985c-e570505dc35d-c000.csv
/part-00001-bb6fa4ba-4e14-49d3-985c-e570505dc35d-c000.csv
/part-00002-bb6fa4ba-4e14-49d3-985c-e570505dc35d-c000.csv
/part-00003-bb6fa4ba-4e14-49d3-985c-e570505dc35d-c000.csv
.
.
.
Include all columns when filtering WARC files for more flexibility.