1dia100mijar / LinkedinScraperCompanies

Program to scrape all the post in a company profile
1 stars 0 forks source link

Filter Promoted Company ads #1

Open jfcolomer opened 1 year ago

jfcolomer commented 1 year ago

Hi there, Thanks for creating this script, it's fabulous! I was wondering what'd be the best way to target not every single post but specifically PROMOTE ADS, this is, the ones listed here: https://www.linkedin.com/company/{company-name}/posts/?feedView=ads For some reason when I update the link variable on the scrape function to be something like link = f'{link}/posts/?feedView=ads' it will only pick up the very first promoted ad but for some reason it won't be able to collect the remaining ones (i.e. 50 ads, it will return only 1 result) and from this result it won't be able to collect likes/links (i.e. an ad with a carousel and items with links). For ALL other posts, it does indeed work as a charm. Thanks

jfcolomer commented 1 year ago

Hi,

Any help to understand how the post individual items are created before they are passed to the postInfo = getPostInformation(str(post)) would be really appreciated:

`def scrape(driver, link, profileType): if (profileType == "Company"): link = f'{link}/posts/?feedView=ads' else: link = f'{link}/recent-activity/all/'

driver.get(link)

time.sleep(3)

posts = {}

old_position = 0
new_position = None
counter = 0
while new_position != old_position:
    # Get old scroll position
    old_position = driver.execute_script(
            ("return (window.pageYOffset !== undefined) ?"
             " window.pageYOffset : (document.documentElement ||"
             " document.body.parentNode || document.body);"))
    time.sleep(1)           #experimentar tirar eleste limte de tempo, para ver se a execução do programa é mais rápida, como o programa está a fazer processamento pode ser que não seja nbecessáio o tempo de sleep como era preciso no insta. No insta apenas estava a fazer scrool sem nenhum processamento pelo meio
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    soup = str(soup)

    results = soup.split('occludable-update')

    # results = {}
    for result in results:

        try:
            counter += 1

            postlink = result.split('data-urn="')[counter].split('"')[0]
            postlink = f'https://www.linkedin.com/feed/update/{postlink}'
        except:
            postlink = ''

        if('linkedin' in postlink):
            posts[postlink] = result
    new_position = scroll(driver, old_position)

print(f'\n\nFound {len(posts)} posts.')
postsFiltered = []

for postlink, post in posts.items():
    postInfo = getPostInformation(str(post))
    postInfo.append(postlink)
    postsFiltered.append(postInfo)`

After refactoring the link variable, link = f'{link}/posts/?feedView=ads' I can get the script to export all the final company promoted posts exported to the csv with this format: https://www.linkedin.com/feed/update/urn:li:activity:00000000000000001 https://www.linkedin.com/feed/update/urn:li:activity:00000000000000002 https://www.linkedin.com/feed/update/urn:li:activity:00000000000000003

and so on ...

But the description, hashtags etc.. will only return the values for the first of the posts, in this case https://www.linkedin.com/feed/update/urn:li:activity:00000000000000001 so it'd be really appreciated if you could explain how the post variable that is referenced here https://github.com/1dia100mijar/LinkedinScraperCompanies/blob/8365a6e2ea9cc721fbdbf2341d8b198b06c3289e/linkdin.py#L50 is generated.

Thanks