broken RSS: #54

Closed bryanwhiting closed 3 years ago

bryanwhiting commented 3 years ago

Thanks for the awesome package, Robert!! I'm loving it.

I tried this feed:

Got this error:

> df = tidyfeed('')
GET request successful. Parsing...

Error in tidyfeed("") : 
  Error in feed parse; please check URL.

  If you're certain that this is a valid rss feed,
  please file an issue at
  Please note that the feed may also be undergoing maintenance.

here's my session info:

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS High Sierra 10.13.6

RobertMyles commented 3 years ago

Hi Bryan, thanks for the issue. That's an unusual RSS, as it just has a series of links instead of the content and structure you'd normally associate with RSS (i.e. here). I'd recommend scraping this directly. I mean, even this little snippet of code will get you the titles and urls pretty easy, which you get parse further :

#> Loading required package: xml2

rss <- ""
read_html(rss) %>% 
    html_text() %>% 
#> [[1]]
#>  [1] "DataTau News for Data Science5 tips for aspiring and junior data engineers"                       
#>  [2] ">What To Do When You Can't AB Test"                                                                                                                
#>  [3] ">PyTorch vs. TensorFlow – a detailed comparison"                                                                                                      
#>  [4] ">The Personal Python Data Science Toolkit"                                                                                                         
#>  [5] ">Why NYC is a Great Place to Break into AI"                                                                                          
#>  [6] ">Better Preference Predictions: Tunable and Explainable Recommender Systems"                                                       
#>  [7] ">A Simple Guide to Semantic Segmentation"                                           
#>  [8] ">All AI and Data Science News in one Place"                                                                                                                                                              
#>  [9] ">Intro to forecasting with FB's Prophet (python)"                                                                                                                  
#> [10] ">Complete Machine Learning Using Azure Machine Learning"                                                                                              
#> [11] ">AutoML for Data Augmentation"                                                                                                                    
#> [12] ">How to Do A/B Testing: A Checklist You’ll Want to Bookmark"                                                            
#> [13] ">16 Text Preprocessing Techniques in Python for Twitter Sentiment Analysis"                                                                                             
#> [14] ">Using Transfer Learning for NLP with Small Data"                                                                              
#> [15] ">Overview of the different approaches to putting ML models in production"            
#> [16] ">Train models and run notebooks on AWS cheaper and simpler than with SageMaker"                                       
#> [17] ">Using Reinforcement Learning to Design a Better Rocket Engine"                                                  
#> [18] ">Amex Data Science Interview Questions"                                                                                                           
#> [19] ">Square Data Science Interview Questions"                                                                                                      
#> [20] ">The Job Board for Data Scientists and Machine Learners Only"                                                                                                                                          
#> [21] ">Is analytics a luxury only to Giants in the Financial Services?"                                                     
#> [22] ">A visual exploration of Gaussian processes"                                                                                                                     
#> [23] ">5 domains of ecommerce Data Strategy"                                                                                                  
#> [24] ">Become a Pro at Pandas, Python’s data manipulation Library"
#> [25] ">Three essential skills you'll need as a data scientist"                                                                              
#> [26] ">ERUPT: Expected Response Under Proposed Treatments"                                                                          
#> [27] ">Python - Hadoop interaction tutorial (PySpark, PyArrow, impyla, etc.)"                                                                                                      
#> [28] ">Automate your Flask Deployments on AWS"                                                                                                 
#> [29] ">Citibank Data Science Interview Questions"                                                                                                   
#> [30] ">How I became a data scientist"                                                                                                                           
#> [31] ">"

Created on 2020-11-14 by the reprex package (v0.3.0)

Hope that works for you, as this feed is too far removed from 'normal' RSS for me to fit it into the package.


bryanwhiting commented 3 years ago

Thanks for the reply! I appreciate the pointer and was able to finish the rest

datatau <- read_html(rss) %>% 
  html_text() %>% 
  str_split("]]") %>%
  .[[1]] %>%
  str_match(., ">(.*)(https://.*)(http[s]?://.*)") %>% %>%
  select(V2, V3) %>%
  rename(item_title=V2, item_link=V3) %>%
  drop_na() %>%


# A tibble: 29 x 2
   item_title                                     item_link                                                    
   <chr>                                          <chr>                                                        
 1 "5 tips for aspiring and junior data engineer……
 2 "What To Do When You Can't AB Test"  …
 3 "PyTorch vs. TensorFlow – a detailed comparis……
 4 "The Personal Python Data Science Toolkit"…
 5 "Why NYC is a Great Place to Break into AI"…
 6 "Better Preference Predictions: Tunable and E……
 7 "A Simple Guide to Semantic Segmentation"…
 8 "All AI and Data Science News in one Place"                                       
 9 "Intro to forecasting with FB's Prophet (pyth… 
10 "Complete Machine Learning Using Azure Machin……
# … with 19 more rows
RobertMyles commented 3 years ago

Great :-)