OSU-HPCC / hpc-novice-first-draft

Repo for passing back and forth ideas before pushing them up to the SWC site.
0 stars 0 forks source link

Twitter Conversation #1

Open pdoehle opened 8 years ago

pdoehle commented 8 years ago

We can start a conversation about the scripts here. @mhaffner

mhaffner commented 8 years ago

@pdoehle I decided to go ultra simple with the scripts here. Instead of removing the null byte characters, parsing the tweets, and removing duplicates, I decided to just use the python parsing script. With such a small raw data file, the other parts were not necessary.

Having said that, there is still a lot of information in the parsing script - it extracts 24 variables. Some of this may need to be trimmed down. In fact, it might be a good idea to remove all of the Twitter handles from the raw file for the sake of privacy. It is public information and all of these accounts are public, but if a user deletes a tweet, people like me still have it, and Twitter even removes all location information from tweets after 10 or 14 days (at least they used to).

If you have questions, let me know.

pdoehle commented 8 years ago

@mhaffner I think you're right, simpler is better. I'm looking through the script trying to get a sense of what it does and there is some new syntax that I haven't seen before. Can you explain the line if __name__ == "__main__": to me? Also, I think erring on the side of privacy would be better.

pdoehle commented 8 years ago

@mhaffner On a completely different note, we need a creative name for our fictional graduate student that we will be following through the lesson. If you think of something fun let me know.

mhaffner commented 8 years ago

Sure. The if __name__ == "__main__": statement is not really necessary for this type of exercise, but it's good coding practice to set it up this way. Since you can call any python script from any other python script with import somescript, the if __name__ == "__main__" statement ensures that only the functions will be imported, and the freestanding code will not be executed if it is called this way.

If you simply run the script from an IDE or execute it from a terminal, if __name__ == "__main__" is true; otherwise, it is false. You can, of course, get around this by just doing from somescript import somefunction but this is inconvenient if you are using multiple functions from the script. Does that make sense?

mhaffner commented 8 years ago

I totally forgot to mention that the script should be called with the raw file as the argument. It should look like python parse_tweets.py parsed-tweets.csv .

It does look horribly inconsistent to have underscores and dashes separating words in the names of files and scripts, so feel free to change this if you want. I use dashes wherever possible - basically in everything but python scripts - (since typing a dash in the shell is faster), but python scripts don't like dashes.

pdoehle commented 8 years ago

That's helpful, I was actually wondering about that for my masters creative component. if __name__ == "__main__": will be really helpful.

pdoehle commented 8 years ago

It seems I remember you mentioning this. The raw twitter data, is that in JSON format?

pdoehle commented 8 years ago

Another quick question, the two inputs into the function are cleaned_tweet_file and parsed_tweet_file. From what I see, you are taking the raw twitter text and then parsing it and putting in a nice table. Is there some sort of processing of the tweets that happens before that?

mhaffner commented 8 years ago

Good questions. The raw Twitter data is in JSON format saved in a .csv. Essentially it's a one column .csv with all the JSON in that one column. It's really probably not the best way to do things, but in the beginning I didn't have any idea what I was doing.

On your second question - yes, there are actually two inputs: the raw_tweet_file, originally named cleaned_tweet_file, but this has been changed. And yes, there was some processing before getting to this point originally, but it's not necessary with such a small dataset. I changed the name of this variable.

The parsed_tweet_file is the path of the file of the parsed tweets (created in the script), named whatever you would like. I included the file parsed-tweets.csv just to show what the tweets should look like after they are parsed. You are right - essentially I'm putting this info into a table format for easy ingestion into a RDMS (Postgres).

mhaffner commented 8 years ago

Just for clarification, I execute this from the terminal like this:

python parse_tweets.py raw-tweets.csv whatever-name-i-want.csv