Reduce Noise - Githubissues

adam-broussard commented 2 years ago

I think we are going to have to pare down our genres further... I have just finished writing a query function, and I am finding that a huge number of genres don't have very many associated books, leading to cases where we overfit because there is, e.g., only one book in the "Productivity" genre and it mentions the word "time" more than once, so it dominates any query that mentions "time". Approximately 100 genres have <10 associated books.

adam-broussard commented 2 years ago

I should also mention that running the query function with weight_scheme=1 seems to help somewhat in my preliminary testing.

youngant commented 2 years ago

Maybe it's worth setting weight_scheme=1 to the default?

One thing we could do is just drop all genres with too few books. The figure below suggests 25 books is a good cutoff. (There are genres with many more books than what is shown in the figure.) This means dropping 119 genres (see list below).

95a876e3-eb09-4b2f-a331-331c49ae7d95

(Sorry, axes background is transparent and doesn't show up well in dark mode.)

Dropped genres: ['Christianity', 'Environment', 'Sociology', 'Drama', 'Occult', 'Suspense', 'World War II', 'Shapeshifters', 'Anthologies', 'Games', 'Leadership', 'Mythology', 'Computer Science', 'Anthropology', 'Polyamorous', 'Lds', 'Adult', 'Marriage', 'Relationships', 'Academic', 'Fan Fiction', 'Esoterica', 'Dungeons and Dragons', 'Inspirational', 'Comics', 'Book Club', 'Science Fiction Fantasy', 'Football', 'Couture', 'Apocalyptic', 'Epic', 'Military History', 'Fairy Tales', 'Category Romance', 'Audiobook', 'Gardening', 'Law', 'Asian Literature', 'Family', 'Sexuality', 'Architecture', 'Medical', 'Love', 'Design', 'Combat', 'Space', 'Pseudoscience', 'American', 'Biology', 'Prayer', 'Novella', 'Race', 'Aviation', 'Gothic', 'Unfinished', 'Disability', 'Criticism', 'Modern', 'Textbooks', 'Magical Realism', 'Humanities', 'Action', 'Military', 'Nurses', 'Womens', 'Biography Memoir', 'Central Africa', 'Diary', 'New York', 'Manga', 'Pop Culture', 'Witchcraft', 'Spy Thriller', 'Kids', 'Mental Health', 'Dc Comics', 'Currency', 'Gamebooks', 'Crafts', 'True Story', 'Gender', 'Teaching', 'Speculative Fiction', 'Linguistics', 'Social Issues', 'Computers', 'Romantic', 'Folk Tales', 'World Of Warcraft', 'Fairies', 'Harlequin', 'Pornography', 'Alcohol', 'Female Authors', 'Neuroscience', 'Biblical Fiction', 'Buisness', 'Eastern Africa', 'Roman', 'Futuristic', 'Productivity', 'School Stories', 'Church', 'Gay Romance', 'African Literature', 'Nobel Prize', 'Northern Africa', 'Family Law', 'Menage', 'Own', 'Polyamory', 'North American Hi...', 'Superheroes', 'Nature', 'Political Science', 'Warfare', 'Sci Fi Fantasy', 'Paranormal Urban Fantasy', 'Love Inspired']

adam-broussard commented 2 years ago

That makes sense to me. It would be really nice if we had a way of grouping genres, because a lot of these sound like bigger genres with slightly different phrasing (for example 'Dc Comics' should fit under 'Comics'). In practice, putting in the effort to reduce these by hand may not actually improve our dataset that considerably though.

youngant commented 2 years ago

Agreed. Combining these will still result in pretty small genres. There are probably only ~500 books total across all of these genres.

Also, here's the list of genres that still remain:

'Fantasy', 'Fiction', 'Romance', 'Young Adult', 'Nonfiction', 'Historical', 'Mystery', 'Science Fiction', 'Sequential Art', 'Childrens', 'Classics', 'Horror', 'History', 'Poetry', 'Paranormal', 'Philosophy', 'Thriller', 'Religion', 'Christian Fiction', 'Short Stories', 'Biography', 'Christian', 'New Adult', 'Science', 'Womens Fiction', 'Cultural', 'Business', 'Humor', 'Plays', 'Media Tie In', 'Psychology', 'Autobiography', 'Self Help', 'Animals', 'Sports', 'Erotica', 'Spirituality', 'Art', 'Contemporary', 'Dark', 'Adult Fiction', 'Music', 'Politics', 'Travel', 'Adventure', 'Realistic Fiction', 'Crime', 'LGBT', 'Economics', 'Food and Drink', 'Novels', 'Holiday', 'Westerns', 'Parenting', 'War', 'Urban', 'Language', 'GLBT', 'Feminism', 'Health', 'Literature', 'Amish', 'European Literature', 'Culture', 'Reference', 'Education', 'Writing'

Now that the list isn't so large, we might want to do a little bit of tweaking them by hand to clean up the predictions. For example, I think "GLBT" is the same as "LGBT" (but we should double check). Also, maybe "Fiction" should be renamed to something like "General Fiction" so our predictions look less naive.

adam-broussard commented 2 years ago

Okay, I'm realizing what the issue is... Evidently the way genres are determined on Goodreads is not by picking from a set of pre-made tags. Instead, users can sort their books with custom shelves, which they can name anything they want. As a result, coherent genres emerge just from many people happening to have personalized shelves that all have the same (or possibly very similar) names. This is why we are ending up with a ridiculous number of subgenres and slightly varying genres; a "genre" can be literally anything a user types. As a result, we may want to put an additional constraint on our genre assignments that a sufficient fraction of the total votes be in a given "genre" or something. I'll have to think about how to work around it.

adam-broussard commented 2 years ago

If we want, we could REALLY pare down the genres to about 20 and then manually assign every single unique genre in the dataset to one of those. This would make the classification problem much simpler and less granular. Some ideas for this could be using Goodreads' default genres (below) or Barnes & Noble's most popular subject list (further below).

adam-broussard commented 2 years ago

One thing to keep in mind also is that many books have multiple genres that are good descriptors (at least if we aren't operating on a really high level like the lists above), so the fact that we are basing our analysis on the single top genre by votes may be improvable by incorporating the top N genres instead somehow.

youngant commented 2 years ago

I'm not really sure what we should do. I'm not sure that every "genre" we have matches up easily with one of the categories from Goodreads or Barnes & Noble, and I kind of like the range of genres we have. That being said, I think manually messing with them somewhat would be good. Honestly, I'm pleasantly surprised at how sensible the "genres" are given the way they are generated by Goodreads.

I'm pretty interested to see how the multi-genre prediction would work. I'd vote that we at least see what those predictions look like before deciding how to further manipulate the list we have.

adam-broussard commented 2 years ago

At least with our current setup, we could modify the current query function to output the top N genres instead of the top 1 - that would be a quick implementation. I'll also make an issue to create a function that runs on the test data and outputs some results.

youngant commented 2 years ago

I just realized something you may have been referencing earlier, but I didn't fully understand the issue. We're constructing the combined descriptions based on the top genre, so a book with a plurality (not even a majority) of the votes towards a given genre will have it's description be considered as representative of the genre as the description of a book with 100% of the votes for that genre. There might be a way for us to compute the TF-IDF scores using some kind of weighting based on the genre votes, but I'd have to spend some time looking at the equations. It definitely wouldn't be as simple as just combining the text though.

adam-broussard commented 2 years ago

This is a good point. I was talking about a different (but similar problem) in my above comment (which led to #39), which is that it's difficult to actually measure our success. For example, if we predict the second most voted genre but not the first, was it a success?

More generally, we can predict N genres and compare against M voted genres from the dataset. If our predicted genres are identical to the voted genres but out of order, how would we measure how close we got? What if some matched value and position, but others didn't? Fundamentally, no book belongs in a single genre, so we may need a way to generalize our measurements of success so we can make multiple predictions and measure how well we're doing.

What you're talking about here is related in that we are assigning each book a single genre in the training stage when we may not have to. I like your idea of spreading the TF-IDF weight amongst the top M genres during training - that seems much more reasonable than the single genre method we're using right now.

youngant commented 2 years ago

Okay, I understood it the way you had meant it then. See new issues #42 and #43.

adam-broussard / bestreads

Reduce Noise #37