Ljferrer / Ghost

Lil BERT will help you rap!
2 stars 0 forks source link

Script to Aggregate & Clean Dataset #5

Closed Ljferrer closed 5 years ago

Ljferrer commented 5 years ago

Rules:

Copy fields:

Clean lyrics fields:

Ljferrer commented 5 years ago

Examples of dirty data

Over 80 chars:

Ambiguous Headers:

Loving me like me Especially when wife be acting funny Cause it don't need for no money And it can be so lovely Take your hand and aim to please Don't drip a drip on your Dungarees Take a towel and cover your jeans Move it back and forth till you see babies

Repeat 1 (2x)

I'm a jerk, if your a jerk How do it work? Show me how you move it back and forth now baby For the guys and the girls Move it all around now work, oh yeah

Ljferrer commented 5 years ago

Preliminary Formatter committed in 0bf669710c33cb1ea3426e4efbcd28e651b45110

thammegowda commented 5 years ago

@Ljferrer what do you mean by formatting noise?

Ljferrer commented 5 years ago

[My Considerations] 1) Sometimes verses get clipped because of a \n\n in the middle (denoting a pause) (I'm using double newline as a section separator) 2) Sometimes verse headers do not have [brackets] 3) Sometimes a chorus repeat is denoted with a number, [HOOK 2x], or (Chorus), etc. 4) I can’t always extract just the lyrics 5) Somehow, a fair bit of skits/interludes/non-poetry got downloaded (Maybe 2-5% of the files have something like this)

[In Conclusion] I’m just keeping the stuff that was formatted as expected.

[On Second Thought] PS. Re: 4) Maybe I could use spaCy language detection to see if each sentence is above an ‘en’ score threshold —> consider that natural language

PPS. Re: 5) Maybe it doesn’t matter

Sent from my iPhone

On Jun 4, 2019, at 6:23 PM, Thamme Gowda notifications@github.com wrote:

@Ljferrer what do you mean by formatting noise?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Ljferrer commented 5 years ago

Tried using spaCy to to detect a language threshold for each line. It does not work well with the slang. It can detect the language given the whole document, however.

Ljferrer commented 5 years ago

Closed by b0e27a0f7d0770ec25e9019c476bdc970c60606d