URLsFinder find_urls_to_scrape_from_suggested_url results to KeyError

whoislocalhost commented 3 years ago

KeyError appears when I'm trying to execute cell 3.4 finding URLs to scrape from Jupyter Notebook file OBEC_Starter_Kit_URLs_Finder.ipynb under URLsFinder directory. The following method is called:

dfnt = uf.start_ws(
    timeout=20,
    sleep=0.5,
    urlsatstart=10,
    urlsatend=10,
    slice=0,
    url='Suggested URL',
    what='find_urls_to_scrape_from_suggested_url'
)

And after the method has loaded the file black_list_urls.csv the following error message appears:

KeyError                                  Traceback (most recent call last)
<ipython-input-19-168d49c7384c> in <module>
----> 1 dfnt = uf.start_ws(
      2     timeout=20,
      3     sleep=0.5,
      4     urlsatstart=10,
      5     urlsatend=10,

StarterKit-master\src\URLsFinderWS.py in start_ws(self, *args, **kwargs)
    217                 dfns = self.black_list_urls(self.load_files(file='black_list'),
    218                                             slice=x)            
--> 219                 dfnt = self.get_urls_to_scrape(dfns,
    220                                                timeout=timeout,
    221                                                sleep=sleep,

StarterKit-master\src\URLsFinderWS.py in get_urls_to_scrape(self, dfns, *args, **kwargs)
    529             dfe = dfe[list(dfns.columns) + ['Error']]
    530         dfne.drop_duplicates(inplace=True)
--> 531         dfne.drop_duplicates(subset=['ID', 'URL to scrape'], inplace=True)
    532         dfne.reset_index(drop=True, inplace=True)
    533         dfne = dfne[list(dfns.columns)+['URL to scrape']]

lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   3028             if is_iterator(key):
   3029                 key = list(key)
-> 3030             indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
   3031 
   3032         # take() does not accept boolean indexers

lib\site-packages\pandas\core\indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
   1264             keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
   1265 
-> 1266         self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
   1267         return keyarr, indexer
   1268 

lib\site-packages\pandas\core\indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
   1306             if missing == len(indexer):
   1307                 axis_name = self.obj._get_axis_name(axis)
-> 1308                 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   1309 
   1310             ax = self.obj._get_axis(axis)

KeyError: "None of [Index(['ID', 'Name', 'URL', 'Suggested URL', 'Link position',\n       'Has equal domain', 'Has Simple Suggested URL', 'URL to scrape'],\n      dtype='object')] are in the [columns]"

Error indicates that the pandas data frame is empty when it's not supposed to be and this causes an exception that is not handled.

EnterpriseCharacteristicsESSnetBigData commented 3 years ago

Коста,

Виж какъв проблем възниква при тестването на Starter Kit.

Явно германските колеги са почнали да го тестват.

Поздрави, Галя

On 22 Jun 2021 1:26 pm, Anssi Lintulampi @.***> wrote:

KeyError appears when I'm trying to execute cell 3.4 finding URLs to scrape from Jupyter Notebook file OBEC_Starter_Kit_URLs_Finder.ipynb under URLsFinder directory. The following method is called:

dfnt = uf.start_ws( timeout=20, sleep=0.5, urlsatstart=10, urlsatend=10, slice=0, url='Suggested URL', what='find_urls_to_scrape_from_suggested_url' )

And after the method has loaded the file black_list_urls.csv the following error message appears:

KeyError Traceback (most recent call last)

in ----> 1 dfnt = uf.start_ws( 2 timeout=20, 3 sleep=0.5, 4 urlsatstart=10, 5 urlsatend=10, StarterKit-master\src\URLsFinderWS.py in start_ws(self, *args, **kwargs) 217 dfns = self.black_list_urls(self.load_files(file='black_list'), 218 slice=x) --> 219 dfnt = self.get_urls_to_scrape(dfns, 220 timeout=timeout, 221 sleep=sleep, StarterKit-master\src\URLsFinderWS.py in get_urls_to_scrape(self, dfns, *args, **kwargs) 529 dfe = dfe[list(dfns.columns) + ['Error']] 530 dfne.drop_duplicates(inplace=True) --> 531 dfne.drop_duplicates(subset=['ID', 'URL to scrape'], inplace=True) 532 dfne.reset_index(drop=True, inplace=True) 533 dfne = dfne[list(dfns.columns)+['URL to scrape']] lib\site-packages\pandas\core\frame.py in __getitem__(self, key) 3028 if is_iterator(key): 3029 key = list(key) -> 3030 indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1] 3031 3032 # take() does not accept boolean indexers lib\site-packages\pandas\core\indexing.py in _get_listlike_indexer(self, key, axis, raise_missing) 1264 keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr) 1265 -> 1266 self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing) 1267 return keyarr, indexer 1268 lib\site-packages\pandas\core\indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing) 1306 if missing == len(indexer): 1307 axis_name = self.obj._get_axis_name(axis) -> 1308 raise KeyError(f"None of [{key}] are in the [{axis_name}]") 1309 1310 ax = self.obj._get_axis(axis) KeyError: "None of [Index(['ID', 'Name', 'URL', 'Suggested URL', 'Link position',\n 'Has equal domain', 'Has Simple Suggested URL', 'URL to scrape'],\n dtype='object')] are in the [columns]" Error indicates that the pandas data frame is empty when it's not supposed to be and this causes an exception that is not handled. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

EnterpriseCharacteristicsESSnetBigData commented 3 years ago

Dear Anssi,

My colleague Kostadin tried to resolve the problem, but it’s not clear how the errors were raised. May I kindly ask you to send us the input files which you were using. Thanks.

Best regards, Galya

From: Anssi Lintulampi @.> Sent: Tuesday, June 22, 2021 1:26 PM To: EnterpriseCharacteristicsESSnetBigData/StarterKit @.> Cc: Subscribed @.***> Subject: [EnterpriseCharacteristicsESSnetBigData/StarterKit] URLsFinder find_urls_to_scrape_from_suggested_url results to KeyError (#1)

KeyError appears when I'm trying to execute cell 3.4 finding URLs to scrape from Jupyter Notebook file OBEC_Starter_Kit_URLs_Finder.ipynb under URLsFinder directory. The following method is called:

dfnt = uf.start_ws(

timeout=20,

sleep=0.5,

urlsatstart=10,

urlsatend=10,

slice=0,

url='Suggested URL',

what='find_urls_to_scrape_from_suggested_url'

)

And after the method has loaded the file black_list_urls.csv the following error message appears:

KeyError Traceback (most recent call last)

in ----> 1 dfnt = uf.start_ws( 2 timeout=20, 3 sleep=0.5, 4 urlsatstart=10, 5 urlsatend=10, StarterKit-master\src\URLsFinderWS.py in start_ws(self, *args, **kwargs) 217 dfns = self.black_list_urls(self.load_files(file='black_list'), 218 slice=x) --> 219 dfnt = self.get_urls_to_scrape(dfns, 220 timeout=timeout, 221 sleep=sleep, StarterKit-master\src\URLsFinderWS.py in get_urls_to_scrape(self, dfns, *args, **kwargs) 529 dfe = dfe[list(dfns.columns) + ['Error']] 530 dfne.drop_duplicates(inplace=True) --> 531 dfne.drop_duplicates(subset=['ID', 'URL to scrape'], inplace=True) 532 dfne.reset_index(drop=True, inplace=True) 533 dfne = dfne[list(dfns.columns)+['URL to scrape']] lib\site-packages\pandas\core\frame.py in __getitem__(self, key) 3028 if is_iterator(key): 3029 key = list(key) -> 3030 indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1] 3031 3032 # take() does not accept boolean indexers lib\site-packages\pandas\core\indexing.py in _get_listlike_indexer(self, key, axis, raise_missing) 1264 keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr) 1265 -> 1266 self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing) 1267 return keyarr, indexer 1268 lib\site-packages\pandas\core\indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing) 1306 if missing == len(indexer): 1307 axis_name = self.obj._get_axis_name(axis) -> 1308 raise KeyError(f"None of [{key}] are in the [{axis_name}]") 1309 1310 ax = self.obj._get_axis(axis) KeyError: "None of [Index(['ID', 'Name', 'URL', 'Suggested URL', 'Link position',\n 'Has equal domain', 'Has Simple Suggested URL', 'URL to scrape'],\n dtype='object')] are in the [columns]" Error indicates that the pandas data frame is empty when it's not supposed to be and this causes an exception that is not handled. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

EnterpriseCharacteristicsESSnetBigData commented 3 years ago

Dear Anssi,

I made changes to the code. Hope I fixed the bugs. I test the code with your data and it worked. Please be in mind that the fields in your input data have to be like words in the URLs_words.txt without “Has ”, otherwise you will get an error in step 3.5 (Upper cases matters). URLs_words.txt: Has Address Has Email Has ID Has Name Has Phone Has Populated place Your file fields: ID Name phone address URL Populated place Email

Sorry for the delayed answer. Please, contact me, if there are other problems.

Best regards, Kostadin

From: Galia Stateva Sent: Friday, June 25, 2021 2:49 PM To: @.***' Cc: 'EnterpriseCharacteristicsESSnetBigData/StarterKit'; Kostadin Georgiev; Katja Subject: RE: [EnterpriseCharacteristicsESSnetBigData/StarterKit] URLsFinder find_urls_to_scrape_from_suggested_url results to KeyError (#1) Importance: High

Dear Anssi,

My colleague Kostadin tried to resolve the problem, but it’s not clear how the errors were raised. May I kindly ask you to send us the input files which you were using. Thanks.

Best regards, Galya

From: Anssi Lintulampi @.> Sent: Tuesday, June 22, 2021 1:26 PM To: EnterpriseCharacteristicsESSnetBigData/StarterKit @.> Cc: Subscribed @.***> Subject: [EnterpriseCharacteristicsESSnetBigData/StarterKit] URLsFinder find_urls_to_scrape_from_suggested_url results to KeyError (#1)

KeyError appears when I'm trying to execute cell 3.4 finding URLs to scrape from Jupyter Notebook file OBEC_Starter_Kit_URLs_Finder.ipynb under URLsFinder directory. The following method is called:

dfnt = uf.start_ws(

timeout=20,

sleep=0.5,

urlsatstart=10,

urlsatend=10,

slice=0,

url='Suggested URL',

what='find_urls_to_scrape_from_suggested_url'

)

And after the method has loaded the file black_list_urls.csv the following error message appears:

KeyError Traceback (most recent call last)

in ----> 1 dfnt = uf.start_ws( 2 timeout=20, 3 sleep=0.5, 4 urlsatstart=10, 5 urlsatend=10, StarterKit-master\src\URLsFinderWS.py in start_ws(self, *args, **kwargs) 217 dfns = self.black_list_urls(self.load_files(file='black_list'), 218 slice=x) --> 219 dfnt = self.get_urls_to_scrape(dfns, 220 timeout=timeout, 221 sleep=sleep, StarterKit-master\src\URLsFinderWS.py in get_urls_to_scrape(self, dfns, *args, **kwargs) 529 dfe = dfe[list(dfns.columns) + ['Error']] 530 dfne.drop_duplicates(inplace=True) --> 531 dfne.drop_duplicates(subset=['ID', 'URL to scrape'], inplace=True) 532 dfne.reset_index(drop=True, inplace=True) 533 dfne = dfne[list(dfns.columns)+['URL to scrape']] lib\site-packages\pandas\core\frame.py in __getitem__(self, key) 3028 if is_iterator(key): 3029 key = list(key) -> 3030 indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1] 3031 3032 # take() does not accept boolean indexers lib\site-packages\pandas\core\indexing.py in _get_listlike_indexer(self, key, axis, raise_missing) 1264 keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr) 1265 -> 1266 self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing) 1267 return keyarr, indexer 1268 lib\site-packages\pandas\core\indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing) 1306 if missing == len(indexer): 1307 axis_name = self.obj._get_axis_name(axis) -> 1308 raise KeyError(f"None of [{key}] are in the [{axis_name}]") 1309 1310 ax = self.obj._get_axis(axis) KeyError: "None of [Index(['ID', 'Name', 'URL', 'Suggested URL', 'Link position',\n 'Has equal domain', 'Has Simple Suggested URL', 'URL to scrape'],\n dtype='object')] are in the [columns]" Error indicates that the pandas data frame is empty when it's not supposed to be and this causes an exception that is not handled. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

EnterpriseCharacteristicsESSnetBigData / StarterKit

URLsFinder find_urls_to_scrape_from_suggested_url results to KeyError #1