1086-Maria-Big-Data / JobAdAnalytics

3 stars 2 forks source link

Include All Columns on Crawl Filtering #72

Closed GabrielMichaelKlein closed 3 years ago

GabrielMichaelKlein commented 3 years ago

Include all columns when filtering WARC files for more flexibility.

vinceecws commented 3 years ago

Schema of the filteredIndex should retain the original schema as defined here: https://github.com/1086-Maria-Big-Data/JobAdAnalytics/blob/883b3543dfff175fc3073b318236280291b392e3/src/main/scala/cc/idx/indexUtil.scala#L22-L54

vinceecws commented 3 years ago

@GabrielMichaelKlein could you clarify how/why it should be saved as #parquet files? It's currently partitioned into multiple (24, I think) .csv files for read/write concurrency.

Include all columns when filtering WARC files for more flexibility. Save as Parquet files.

GabrielMichaelKlein commented 3 years ago

My mistake, honestly don't know why I included that. I think I was just thinking keep it as partitioned CSV files.

Ahimsaka commented 3 years ago

I'm working on trying to debug the regex now but I'll make this change once I figure that out and then will rerun the process to write the CSVs. I think it should be pretty simple?

Originally I did

`val df = IndexUtil.load(spark).repartition(initial_partition)

return df
  .select("url_surtkey", "fetch_time", "url", "content_mime_type", "fetch_status", "content_digest", "fetch_redirect", "warc_segment", "warc_record_length", "warc_record_offset", "warc_filename")
  .where(col("crawl") === "CC-MAIN-2021-10" && col("subset") === "warc").cache
  .where(col("fetch_status") === 200).cache
  .where(col("content_languages") === "eng").cache
  .where(col("content_mime_type") === "text/html")
  .where(col("url_host_tld") === "com")
  .where(col("url_path").rlike("(?i)^(?=.*(jobs\\.|careers\\.|/job[s]{0,1}/|/career[s]{0,1}/))(?=.*(" + techJobTerms.mkString("|") + ")).*$"))
  `

IndexUtil.load creates a dataframe with the correct structure right? So i just need to take out the .select and it should give us the right csv structure

vinceecws commented 3 years ago

@Ahimsaka Yep, that looks good to me. Also, we do need some folder structure for the results, if we're planning to extend this to all crawls. I suggest we have the folder structured as such:

/filtered_index
    /CC-MAIN-2020-34
        /part-00000-bb6fa4ba-4e14-49d3-985c-e570505dc35d-c000.csv
        /part-00001-bb6fa4ba-4e14-49d3-985c-e570505dc35d-c000.csv
        /part-00002-bb6fa4ba-4e14-49d3-985c-e570505dc35d-c000.csv
        /part-00003-bb6fa4ba-4e14-49d3-985c-e570505dc35d-c000.csv
        .
        .
        .
    /CC-MAIN-2020-40
        /part-00000-bb6fa4ba-4e14-49d3-985c-e570505dc35d-c000.csv
        /part-00001-bb6fa4ba-4e14-49d3-985c-e570505dc35d-c000.csv
        /part-00002-bb6fa4ba-4e14-49d3-985c-e570505dc35d-c000.csv
        /part-00003-bb6fa4ba-4e14-49d3-985c-e570505dc35d-c000.csv
        .
        .
        .
    /CC-MAIN-2020-45
        /part-00000-bb6fa4ba-4e14-49d3-985c-e570505dc35d-c000.csv
        /part-00001-bb6fa4ba-4e14-49d3-985c-e570505dc35d-c000.csv
        /part-00002-bb6fa4ba-4e14-49d3-985c-e570505dc35d-c000.csv
        /part-00003-bb6fa4ba-4e14-49d3-985c-e570505dc35d-c000.csv
        .
        .
        .