GiveToken / GiftBox

Repository for Sizzle
0 stars 0 forks source link

[810] LinkedIn company scraping feature #1088

Closed shreydesai closed 8 years ago

shreydesai commented 8 years ago

Version 0.1

/

/ajax

/js

wogsland commented 8 years ago

1). S!zzle doesn't work. 2). Why the toast instead of just presenting the error directly on the form like, for example, the video section? 3). I don't see ajax/venv. 4). linkedin-scraper.php is scarily insecure. Never use shell_exec, exec, passthru with input from get, post or session variables. 5). Unit tests are a good idea.

This is a good start!

shreydesai commented 8 years ago

Version 0.2

wogsland commented 8 years ago

When I click submit, the progress animation appears to just go forever and in the console I see the error

SyntaxError: Unexpected token < in JSON at position 0(…)

This appears to happen regardless of whether or not the URL is valid or a LinkedIn one.

wogsland commented 8 years ago

When I run the test (is this the correct way?) using

python ajax/scraper/test_linkedin.py

I get the error

SyntaxError: Non-ASCII character '\xe2' in file /Library/WebServer/Documents/GiftBox/ajax/scraper/linkedin.py on line 48, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

wogsland commented 8 years ago

I created a test file to manually run the steps of run.sh and the trouble seems to be on line 7, where I am getting

Segmentation fault: 11

In fact the python3 command by itself yields this.

shreydesai commented 8 years ago

Version 0.3

wogsland commented 8 years ago

Question - why start a new README rather than adding setup instructions to the existing one? New developers to the project are not likely to look for additional READMEs in subdirectories.

Love that you've started a setup script! This project really needs one.

Bad news, it looks like Python 3.5 isn't available on AWS via yum yet:

$ yum list | grep python3
mod24_wsgi-python34.x86_64          3.5-1.22.amzn1                 amzn-main
python34.x86_64                     3.4.3-1.30.amzn1               amzn-main
python34-devel.x86_64               3.4.3-1.30.amzn1               amzn-main
python34-docs.noarch                3.4.3-1.23.amzn1               amzn-main
python34-libs.i686                  3.4.3-1.30.amzn1               amzn-main
python34-libs.x86_64                3.4.3-1.30.amzn1               amzn-main 
python34-pip.noarch                 6.1.1-1.21.amzn1               amzn-main
python34-setuptools.noarch          12.2-1.30.amzn1                amzn-main
python34-test.x86_64                3.4.3-1.30.amzn1               amzn-main
python34-tools.x86_64               3.4.3-1.30.amzn1               amzn-main
python34-virtualenv.noarch          12.0.7-1.12.amzn1              amzn-main

Can we downgrade to 3.4? Also, it looks like a large number of tracked files in the venv directory are now different on my machine after running setup.sh. I deleted the directory and ran setup again - everything was recreated and the same files showed up as different with the exception of venv/pip-selfcheck.json. Since much of the contents of the virtual environment are machine dependent, should we be tracking venv. It seems there may be a standard for this? https://github.com/github/gitignore/blob/master/Global/VirtualEnv.gitignore

Attempting to use the linkedin import I am still seeing what appears to be progress forever and an error in the console:

 SyntaxError: Unexpected token < in JSON at position 0(…)

I am also still seeing the same error with the test.

shreydesai commented 8 years ago

Version 0.4

wogsland commented 8 years ago

Still seeing a ton of changes in tracked files when I run setup.sh....

Also, you can run python3.4 to specify 3.4 instead of 3.5 (you don't have to delete it off of your system).

Otherwise it seems to be working great! I'm going to play with it a little more, but it appears to be very close.

wogsland commented 8 years ago

Why not add the linkedin link to the social media section?

wogsland commented 8 years ago

Looks like some cruft is getting left behind in the uploads directory:

-rw-r--r-- 1 _www wheel 327281 Jun 30 12:37 heroImage.png -rw-r--r-- 1 _www wheel 936 Jun 30 12:37 legacyLogo.png

wogsland commented 8 years ago

Maybe add the instructions to the "Set Up" section of the README rather than the bottom?

wogsland commented 8 years ago

The images are not being replaced. For a consistent UX it prolly makes sense to replace rather than add to this section. Also, If I run the scrape twice without saving only the second set of images gets fully saved even though both appear there before saving. (By "fully" I mean that the db entries are being created, but the image files aren't going into uploads)

wogsland commented 8 years ago

Also, there should be a warning for the user that "ADD" will replace anything they've already got in there if there is anything in the form already (from DB or user input).

wogsland commented 8 years ago

ADD-SUBMIT-CANCEL buttons are a little weird and might be confusing. Why not switch to just having just two buttons if the link is found, like SELECT and CANCEL? And you could have SELECT switch back to SUBMIT if they edit the URL again.

shreydesai commented 8 years ago

Version 0.5

*I consulted with a couple of sources before doing this. All the venv is supposed to be is a small environment with project-specific packages that allows the application to run. The venv will have specific information pertaining to the machine it's running on, which is why versioning it doesn't make sense. Also, whenever the application runs, log files, cache files, and CPython files will be auto-generated; these are not important for the project to run, so they shouldn't be versioned.

wogsland commented 8 years ago

To remove the files from being tracked by the git repository without actually removing the files from your computer:

git rm -r --cached ajax/scraper/venv

Putting files in .gitignore has no effect if they're already being tracked in the repository.

wogsland commented 8 years ago

I'm seeing errors in the javascript and python tests when I run the build script:

Running JavaScript tests

> Sizzle.IO@0.0.0 test /Library/WebServer/Documents/GiftBox
> cd js && ../node_modules/.bin/mocha --require test/bootstrap.js test

fs.js:634
  return binding.open(pathModule._makeLong(path), stringToFlags(flags), mode);
                 ^

Error: ENOENT: no such file or directory, open '/Library/Webserver/Documents/GiftBox/js/scraper.js'
    at Error (native)
    at Object.fs.openSync (fs.js:634:18)
    at Object.fs.readFileSync (fs.js:502:33)
    at Suite.<anonymous> (/Library/WebServer/Documents/GiftBox/js/test/test-linkedin-scrape.js:24:17)
    at context.describe.context.context (/Library/WebServer/Documents/GiftBox/node_modules/mocha/lib/interfaces/bdd.js:47:10)
    at Object.<anonymous> (/Library/WebServer/Documents/GiftBox/js/test/test-linkedin-scrape.js:8:1)
    at Module._compile (module.js:541:32)
    at Object.Module._extensions..js (module.js:550:10)
    at Module.load (module.js:458:32)
    at tryModuleLoad (module.js:417:12)
    at Function.Module._load (module.js:409:3)
    at Module.require (module.js:468:17)
    at require (internal/module.js:20:19)
    at /Library/WebServer/Documents/GiftBox/node_modules/mocha/lib/mocha.js:220:27
    at Array.forEach (native)
    at Mocha.loadFiles (/Library/WebServer/Documents/GiftBox/node_modules/mocha/lib/mocha.js:217:14)
    at Mocha.run (/Library/WebServer/Documents/GiftBox/node_modules/mocha/lib/mocha.js:469:10)
    at Object.<anonymous> (/Library/WebServer/Documents/GiftBox/node_modules/mocha/bin/_mocha:404:18)
    at Module._compile (module.js:541:32)
    at Object.Module._extensions..js (module.js:550:10)
    at Module.load (module.js:458:32)
    at tryModuleLoad (module.js:417:12)
    at Function.Module._load (module.js:409:3)
    at Module.runMain (module.js:575:10)
    at run (node.js:348:7)
    at startup (node.js:140:9)
    at node.js:463:3

npm ERR! Darwin 15.5.0
npm ERR! argv "/usr/local/bin/node" "/usr/local/bin/npm" "run" "test"
npm ERR! node v6.2.2
npm ERR! npm  v3.10.3
npm ERR! code ELIFECYCLE
npm ERR! Sizzle.IO@0.0.0 test: `cd js && ../node_modules/.bin/mocha --require test/bootstrap.js test`
npm ERR! Exit status 1
npm ERR! 
npm ERR! Failed at the Sizzle.IO@0.0.0 test script 'cd js && ../node_modules/.bin/mocha --require test/bootstrap.js test'.
npm ERR! Make sure you have the latest version of node.js and npm installed.
npm ERR! If you do, this is most likely a problem with the Sizzle.IO package,
npm ERR! not with npm itself.
npm ERR! Tell the author that this fails on your system:
npm ERR!     cd js && ../node_modules/.bin/mocha --require test/bootstrap.js test
npm ERR! You can get information on how to open an issue for this project with:
npm ERR!     npm bugs Sizzle.IO
npm ERR! Or if that isn't available, you can get their info via:
npm ERR!     npm owner ls Sizzle.IO
npm ERR! There is likely additional logging output above.

npm ERR! Please include the following file with any support request:
npm ERR!     /Library/WebServer/Documents/GiftBox/npm-debug.log

Running Python scraper tests
EEE
======================================================================
ERROR: test_company_1 (__main__.TestLinkedInScraper)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "ajax/scraper/test_linkedin.py", line 9, in setUp
    self.c1 = LinkedInScraper(base_url + "google").get_company_data()
TypeError: get_company_data() missing 1 required positional argument: 'key'

======================================================================
ERROR: test_company_2 (__main__.TestLinkedInScraper)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "ajax/scraper/test_linkedin.py", line 9, in setUp
    self.c1 = LinkedInScraper(base_url + "google").get_company_data()
TypeError: get_company_data() missing 1 required positional argument: 'key'

======================================================================
ERROR: test_company_3 (__main__.TestLinkedInScraper)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "ajax/scraper/test_linkedin.py", line 9, in setUp
    self.c1 = LinkedInScraper(base_url + "google").get_company_data()
TypeError: get_company_data() missing 1 required positional argument: 'key'

----------------------------------------------------------------------
Ran 3 tests in 0.001s

FAILED (errors=3)
shreydesai commented 8 years ago

Version 0.6

wogsland commented 8 years ago

Warning of replacement really should be in a modal rather than alert(), but everything else looks solid.

screen shot 2016-07-06 at 10 33 04 am

I'm just going to make that #1099 so we can go ahead and include all this great work!!!