RobertMyles / tidyRSS

An R package for extracting 'tidy' data frames from RSS, Atom and JSON feeds
https://robertmyles.github.io/tidyRSS/
Other
82 stars 20 forks source link

broken RSS: datatau.com #54

Closed bryanwhiting closed 3 years ago

bryanwhiting commented 3 years ago

Thanks for the awesome package, Robert!! I'm loving it.

I tried this feed:

http://www.datatau.com/rss

Got this error:

> df = tidyfeed('https://www.datatau.com/rss/')
GET request successful. Parsing...

Error in tidyfeed("https://www.datatau.com/rss/") : 
  Error in feed parse; please check URL.

  If you're certain that this is a valid rss feed,
  please file an issue at https://github.com/RobertMyles/tidyRSS/issues.
  Please note that the feed may also be undergoing maintenance.

here's my session info:

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] lubridate_1.7.9.2 aRxiv_0.5.19      kableExtra_1.3.1  forcats_0.5.0    
 [5] stringr_1.4.0     purrr_0.3.4       readr_1.4.0       tidyr_1.1.2      
 [9] tibble_3.0.4      ggplot2_3.3.2     tidyverse_1.3.0   dplyr_1.0.2      
[13] DT_0.16           tidyRSS_2.0.3    

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.0   xfun_0.19          haven_2.3.1        colorspace_2.0-0  
 [5] vctrs_0.3.4        generics_0.1.0     viridisLite_0.3.0  htmltools_0.5.0   
 [9] yaml_2.2.1         rlang_0.4.8        pillar_1.4.6       withr_2.3.0       
[13] glue_1.4.2         DBI_1.1.0          dbplyr_2.0.0       modelr_0.1.8      
[17] readxl_1.3.1       lifecycle_0.2.0    munsell_0.5.0      anytime_0.3.9     
[21] gtable_0.3.0       cellranger_1.1.0   rvest_0.3.6        htmlwidgets_1.5.2 
[25] evaluate_0.14      knitr_1.30         curl_4.3           fansi_0.4.1       
[29] broom_0.7.2        Rcpp_1.0.5         renv_0.12.2        backports_1.2.0   
[33] scales_1.1.1       install.load_1.2.3 checkmate_2.0.0    webshot_0.5.2     
[37] jsonlite_1.7.1     fs_1.5.0           fastmatch_1.1-0    hms_0.5.3         
[41] digest_0.6.27      stringi_1.5.3      grid_4.0.3         cli_2.1.0         
[45] tools_4.0.3        magrittr_1.5       crayon_1.3.4       pkgconfig_2.0.3   
[49] ellipsis_0.3.1     xml2_1.3.2         reprex_0.3.0       assertthat_0.2.1  
[53] rmarkdown_2.5      httr_1.4.2         rstudioapi_0.13    R6_2.5.0          
[57] compiler_4.0.3    
RobertMyles commented 3 years ago

Hi Bryan, thanks for the issue. That's an unusual RSS, as it just has a series of links instead of the content and structure you'd normally associate with RSS (i.e. here). I'd recommend scraping this directly. I mean, even this little snippet of code will get you the titles and urls pretty easy, which you get parse further :

library(rvest)
#> Loading required package: xml2
library(stringr)

rss <- "http://www.datatau.com/rss"
read_html(rss) %>% 
    html_text() %>% 
    str_split("]]")
#> [[1]]
#>  [1] "DataTauhttp://www.datatau.com/Hacker News for Data Science5 tips for aspiring and junior data engineershttps://medium.com/analytics-and-data/5-tips-for-aspiring-and-junior-data-engineers-8b47ef154367http://www.datatau.com/item?id=30486Comments"                       
#>  [2] ">What To Do When You Can't AB Testhttps://towardsdatascience.com/what-to-do-when-you-cant-ab-test-4e1dff692bf7http://www.datatau.com/item?id=30467Comments"                                                                                                                
#>  [3] ">PyTorch vs. TensorFlow – a detailed comparisonhttps://www.tooploox.com/blog/pytorch-vs-tensorflow-a-detailed-comparisonhttp://www.datatau.com/item?id=30464Comments"                                                                                                      
#>  [4] ">The Personal Python Data Science Toolkithttps://www.alexfranz.com/posts/personal-python-data-science-toolkit-part-1/http://www.datatau.com/item?id=30459Comments"                                                                                                         
#>  [5] ">Why NYC is a Great Place to Break into AIhttps://blog.insightdatascience.com/why-nyc-is-a-great-place-to-break-into-ai-4acc97133391http://www.datatau.com/item?id=29230Comments"                                                                                          
#>  [6] ">Better Preference Predictions: Tunable and Explainable Recommender Systemshttps://blog.insightdatascience.com/tunable-and-explainable-recommender-systems-cd52b6287badhttp://www.datatau.com/item?id=29318Comments"                                                       
#>  [7] ">A Simple Guide to Semantic Segmentationhttps://medium.com/beyondminds/a-simple-guide-to-semantic-segmentation-effcf83e7e54?source=friends_link&sk=3d1a5a32a19d611fbd81028cfd4f23fdhttp://www.datatau.com/item?id=29312Comments"                                           
#>  [8] ">All AI and Data Science News in one Placehttps://allainews.com/http://www.datatau.com/item?id=29480Comments"                                                                                                                                                              
#>  [9] ">Intro to forecasting with FB's Prophet (python)https://www.interviewqs.com/ddi_code_snippets/prophet_intro_http://www.datatau.com/item?id=29272Comments"                                                                                                                  
#> [10] ">Complete Machine Learning Using Azure Machine Learning https://www.udemy.com/machine-learning-using-azureml/?couponCode=DATA090http://www.datatau.com/item?id=29896Comments"                                                                                              
#> [11] ">AutoML for Data Augmentationhttps://blog.insightdatascience.com/automl-for-data-augmentation-e87cf692c366http://www.datatau.com/item?id=29644Comments"                                                                                                                    
#> [12] ">How to Do A/B Testing: A Checklist You’ll Want to Bookmarkhttps://medium.com/@webdavidpage/how-to-run-a-b-testing-a-checklist-youll-want-to-bookmark-99c75aa9860bhttp://www.datatau.com/item?id=29543Comments"                                                            
#> [13] ">16 Text Preprocessing Techniques in Python for Twitter Sentiment Analysishttps://github.com/Deffro/text-preprocessing-techniqueshttp://www.datatau.com/item?id=29410Comments"                                                                                             
#> [14] ">Using Transfer Learning for NLP with Small Datahttps://blog.insightdatascience.com/using-transfer-learning-for-nlp-with-small-data-71e10baf99a6http://www.datatau.com/item?id=30313Comments"                                                                              
#> [15] ">Overview of the different approaches to putting ML models in productionhttps://medium.com/analytics-and-data/overview-of-the-different-approaches-to-putting-machinelearning-ml-models-in-production-c699b34abf86http://www.datatau.com/item?id=30198Comments"            
#> [16] ">Train models and run notebooks on AWS cheaper and simpler than with SageMakerhttps://medium.com/apls/how-to-train-deep-learning-models-on-aws-spot-instances-using-spotty-8d9e0543d365http://www.datatau.com/item?id=30012Comments"                                       
#> [17] ">Using Reinforcement Learning to Design a Better Rocket Enginehttps://blog.insightdatascience.com/using-reinforcement-learning-to-design-a-better-rocket-engine-4dfd1770497ahttp://www.datatau.com/item?id=29857Comments"                                                  
#> [18] ">Amex Data Science Interview Questionshttps://medium.com/acing-ai/amex-data-science-interview-questions-a8d2634c647http://www.datatau.com/item?id=30211Comments"                                                                                                           
#> [19] ">Square Data Science Interview Questionshttps://medium.com/acing-ai/square-data-science-interview-questions-daa67cfe96c9http://www.datatau.com/item?id=30100Comments"                                                                                                      
#> [20] ">The Job Board for Data Scientists and Machine Learners Onlyhttps://ai-jobs.net/#s=1http://www.datatau.com/item?id=29971Comments"                                                                                                                                          
#> [21] ">Is analytics a luxury only to Giants in the Financial Services?https://medium.com/@apurva_39772/zepto-ai-powered-data-analytics-tool-for-financial-services-a01dabe610c7http://www.datatau.com/item?id=29758Comments"                                                     
#> [22] ">A visual exploration of Gaussian processeshttps://distill.pub/2019/visual-exploration-gaussian-processeshttp://www.datatau.com/item?id=29746Comments"                                                                                                                     
#> [23] ">5 domains of ecommerce Data Strategyhttps://medium.com/analytics-and-data/5-domains-of-ecommerce-data-strategy-82b61356042chttp://www.datatau.com/item?id=29603Comments"                                                                                                  
#> [24] ">Become a Pro at Pandas, Python’s data manipulation Libraryhttps://medium.com/analytics-and-data/become-a-pro-at-pandas-pythons-data-manipulation-library-264351b586b1?source=friends_link&sk=cfcd8713cbdae2e48277acf8084c5e13http://www.datatau.com/item?id=30427Comments"
#> [25] ">Three essential skills you'll need as a data scientisthttps://peterscobas.com/2019/04/29/three-essential-skills-youll-need-as-a-data-scientist/http://www.datatau.com/item?id=30426Comments"                                                                              
#> [26] ">ERUPT: Expected Response Under Proposed Treatmentshttps://medium.com/building-ibotta/erupt-expected-response-under-proposed-treatments-ff7dd45c84b4http://www.datatau.com/item?id=30397Comments"                                                                          
#> [27] ">Python - Hadoop interaction tutorial (PySpark, PyArrow, impyla, etc.)https://thegurus.tech/posts/2019/05/hadoop-python/http://www.datatau.com/item?id=30355Comments"                                                                                                      
#> [28] ">Automate your Flask Deployments on AWShttps://blog.insightdatascience.com/automate-your-flask-deployments-on-aws-db4d8e2345ahttp://www.datatau.com/item?id=30337Comments"                                                                                                 
#> [29] ">Citibank Data Science Interview Questionshttps://medium.com/acing-ai/citibank-data-science-interview-questions-1ac5c71ff29http://www.datatau.com/item?id=30310Comments"                                                                                                   
#> [30] ">How I became a data scientisthttps://www.peterscobas.com/2019/04/26/how-i-became-a-data-scientist/http://www.datatau.com/item?id=30297Comments"                                                                                                                           
#> [31] ">"

Created on 2020-11-14 by the reprex package (v0.3.0)

Hope that works for you, as this feed is too far removed from 'normal' RSS for me to fit it into the package.

Rob

bryanwhiting commented 3 years ago

Thanks for the reply! I appreciate the pointer and was able to finish the rest

datatau <- read_html(rss) %>% 
  html_text() %>% 
  str_split("]]") %>%
  .[[1]] %>%
  str_match(., ">(.*)(https://.*)(http[s]?://.*)") %>% 
  as.data.frame() %>%
  select(V2, V3) %>%
  rename(item_title=V2, item_link=V3) %>%
  drop_na() %>%
  as_tibble()

returns

# A tibble: 29 x 2
   item_title                                     item_link                                                    
   <chr>                                          <chr>                                                        
 1 "5 tips for aspiring and junior data engineer… https://medium.com/analytics-and-data/5-tips-for-aspiring-an…
 2 "What To Do When You Can't AB Test"            https://towardsdatascience.com/what-to-do-when-you-cant-ab-t…
 3 "PyTorch vs. TensorFlow – a detailed comparis… https://www.tooploox.com/blog/pytorch-vs-tensorflow-a-detail…
 4 "The Personal Python Data Science Toolkit"     https://www.alexfranz.com/posts/personal-python-data-science…
 5 "Why NYC is a Great Place to Break into AI"    https://blog.insightdatascience.com/why-nyc-is-a-great-place…
 6 "Better Preference Predictions: Tunable and E… https://blog.insightdatascience.com/tunable-and-explainable-…
 7 "A Simple Guide to Semantic Segmentation"      https://medium.com/beyondminds/a-simple-guide-to-semantic-se…
 8 "All AI and Data Science News in one Place"    https://allainews.com/                                       
 9 "Intro to forecasting with FB's Prophet (pyth… https://www.interviewqs.com/ddi_code_snippets/prophet_intro_ 
10 "Complete Machine Learning Using Azure Machin… https://www.udemy.com/machine-learning-using-azureml/?coupon…
# … with 19 more rows
RobertMyles commented 3 years ago

Great :-)