abjer / sds2019

Social Data Science 2019 - a summer school course
https://abjer.github.io/sds2019
46 stars 96 forks source link

How should I analyze the Log? #41

Open snorreralund opened 5 years ago

snorreralund commented 5 years ago

The objective of the analysis of the look is to document data quality. This means being transparent about your data collection. Analytically you look for signs of potentially systematic missing data (certain error codes being systematically distributed in part of the scrape, holes in the time series indicating an error in the scraping program), and artifacts (suspiciously similar response sizes or suspiciously short responses).

  1. Analyze systematic connection errors / error codes and systematically missing data.

    • Plot the Number of Errors codes over time - to see if there are any systematics in missing answers
    • Plot the Number of Errors codes in relation to different subsections of your scrape (cnn.com/health or cnn.com/business) to see if there are any systematics in missing answers.
    • Plot length before response (dt column delta_t) over time, to see if server response times are changing, indicating potential problems.
  2. Look for artifacts, and potential signs of different html formatting. Systematically different formatting of the HTML will probably force you to design two or more separate parsing procedures.

    • Plot size distribution (length of html/json response) - i.e. histogram /sns.distplot-, to look for potential artifacts and errors (unexpected small responses, standard responses with the exact same length).
    • Plot size of response over time, or in relation to a specific subsections (e.g. cnn.com/health or cnn.com/business), to look for potentially formatting issues or errors in different subsections.

If any problems are present, you get the chance to demonstrate your serious attitude towards methodological issues. You should sample anomolies (i.e. breaks in the time series, samples suspiciously small response lengths or too similar (i.e. standard empty response)) and inspect them manually to find the explanation (report this). If a real issue - think about potential consequences (if any) to your analysis - and you should now comment on potential causes and explanations, thereby demonstrating strong methodological scraping skills.

BjornCilleborg commented 5 years ago

If we connect o several different data sources using the connector (with different call id), should we then present each data source in it's own graph? And what about connections we only use once? I.e. once we connect to a website, 200 times we connect to a financial API.

carolinemariesachmann commented 5 years ago

Hi Snorre, There is a lot of confusion regarding the log. Do you want us to plot all of the mentioned figures above and hand them in either in the paper or in the appendix? If so, we need to run all of our code again, which means we will extract some newer articles than we use in our analysis? Is this a problem or should we just keep our analysis as it is but make the figures based on the new log?