SubjectRefresh / refresh

A Machine Learning Question Generator
Other
4 stars 3 forks source link

Fix subject listings #15

Closed developius closed 9 years ago

developius commented 9 years ago

I think spaces are being removed! screen shot 2015-09-09 at 11 57 15

popey456963 commented 9 years ago

They are indeed, @OliCallaghan, I'm taking this considering I wrote the code and understand the problem it was made to solve.

data = cleanArray($('.emphasized-link').text().replace(/ +?|\r/g, '').split("\n")); is the bit that breaks stuff.

developius commented 9 years ago

Thanks @popey456963 - when this is fixed I think we're ready for a Beta release!

developius commented 9 years ago

@popey456963 status?

popey456963 commented 9 years ago

Okay, so, what is happening is that I get an ugly array:

[ blarg \r\r\r\r \r\r\r bash]

And, before, what I was doing was simply ignoring those to get:

[blarg, bash]

What I didn't realise was happening though, was that it also removes spaces

Business Studies --> BusinessStudies

To fix this I've had to remove the line above, and instead used:

data = cleanArray($('.emphasized-link').text().replace(\r/g, '').split("\n"));

(Note the missing / +?|). However, this returns six undefineds for every actual word. To fix this I have two options, doing it client-side or doing it server-side. The client-side method I've already shown you, and consists of looping through each instance, testing whether it's undefined, and if it is, remove it.

The server-side method is to detect whether each array instance contains four digits (using RegeX most likely) and then dd those to a new array (and only those) which is then passed to the callback.

I have started working on the client-side, but promptly gave up when informed that it's efficiency was so bad (it also means the client receives 6x the data). The server-side solution is coming on nicely, with the nicely written RegeX being short (/d{4}). Currently I'm having some problems making my solution efficient, looking for alternatives to RegeX (which is using a considerable amount of time on each pass, and requires 6*categories for each exam board.

In that sense, I am looking for suggestions of better methods, other than RegeX for sorting an array as described above. Any ideas?

developius commented 9 years ago

@popey456963 I can't think of another way to remove those things from the array... :(

developius commented 9 years ago

@popey456963 status?

popey456963 commented 9 years ago

Oh yeah, I completely forgot about this. I have code that stops the removal of spaces and instead removes arrays based on whether or not it contains four numbers. This works great, except randomly (1/10, different every time?!) we get an entry that looks like:

\t\n\t\t\t\t\t\nSubject\t\t\t\t9999\t\t\t\t\n

Which doesn't print out in a drop down box very well. I'm currently writing code that removes all \n's and only remove \t's when there is more than one however I am running into some troubles. I don't really want to use RegeX as that is hideously slow, so I'm using something that should hopefully be O(n).

developius commented 9 years ago

@popey456963 regex is going to be quick enough I think so if it's easier just use that. If you're set on not using regex then I suggest you replace 4 \t's (as long as it's always 4 \t's between the subject name and the syllabus number) with a space and then trim all spaces from the start and all spaces from the end. That should give you what you want.

This regex [a-zA-Z](\t{4})\d almost does what you want except it's (in Python anyway) including the last character of the name and the first character of the number which is strange because it's using a capture group to avoid exactly that issue...

developius commented 9 years ago

@popey456963 how far did you get with this regarding my previous comment?