Roleren / ORFik

MIT License
32 stars 9 forks source link

download.SRA.metadata throws an error #129

Closed yeroslaviz closed 1 year ago

yeroslaviz commented 1 year ago

When trying to download the metadata from an bioproject, I get this error:

 > download.SRA.metadata("PRJNA591214")
trying URL 'https://trace.ncbi.nlm.nih.gov/Traces/sra-db-be/runinfo?term=PRJNA591214'
downloaded 6316 bytes

Could not find abstract for project
Empty data.table (0 rows and 45 cols): Run,spots,bases,spots_with_mates,avgLength,size_MB...

I have tried different values:

study <- download.SRA.metadata(SRP = "PRJNA472972", outdir = "./")
study <- download.SRA.metadata(SRP = "GSE114882", outdir = "./")
study <- download.SRA.metadata(SRP = "SRP148891", outdir = "./")

only SRP148891 works correctly.

I have installed the development version using devtools.

any ideas, if I need to change something?

Assa

> sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 18.04.6 LTS

Matrix products: default
BLAS/LAPACK: /fs/gpfs41/lv07/fileset07/home/b_cox/yeroslaviz/miniconda3/envs/R/lib/libopenblasp-r0.3.21.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] BSgenome.Scerevisiae.UCSC.sacCer3_1.4.0
 [2] BSgenome_1.66.2                        
 [3] rtracklayer_1.58.0                     
 [4] data.table_1.14.6                      
 [5] GenomicFeatures_1.50.3                 
 [6] AnnotationDbi_1.60.0                   
 [7] ORFik_1.19.1                           
 [8] GenomicAlignments_1.34.0               
 [9] Rsamtools_2.14.0                       
[10] Biostrings_2.66.0                      
[11] XVector_0.38.0                         
[12] SummarizedExperiment_1.28.0            
[13] Biobase_2.58.0                         
[14] MatrixGenerics_1.10.0                  
[15] matrixStats_0.63.0                     
[16] GenomicRanges_1.50.2                   
[17] GenomeInfoDb_1.34.6                    
[18] IRanges_2.32.0                         
[19] S4Vectors_0.36.1                       
[20] BiocGenerics_0.44.0                    

loaded via a namespace (and not attached):
  [1] colorspace_2.0-3       rjson_0.2.21           ellipsis_0.3.2        
  [4] rprojroot_2.0.3        fs_1.5.2               remotes_2.4.2         
  [7] bit64_4.0.5            fansi_1.0.3            xml2_1.3.3            
  ...
Roleren commented 1 year ago

Hm, looks like NCBI have broken their API to fetch sample data, will do a check and fix it

Roleren commented 1 year ago

Fixed, run devtools::install_github("Roleren/ORFik")

I had to dig into the new API, looks like the Trace DB is being deprecated, well well. Fixed now at least.

yeroslaviz commented 1 year ago

thanks,

now it works with all options (PRJ, GSE, SRP).

yeroslaviz commented 1 year ago

I'm not sure if I need to open a new issue for that, as this is just a question.

If I already have the fastq file, and I don't want to download them again ( or if they are my own fastq files without a metadata file from GSE - how can I start an experiment?

Do I just need to create the conf object with my needed paths and put the fastq files in the raw_data folder? Can I also create the metadata file manually? Do I need it later on for the analysis or is it mainly for the naming of tha samepls.

yeroslaviz commented 1 year ago

Another error I get now, not sure if it is again related to changes in the links from ensembl, is when I try to download the annotation using this command:

getGenomeAndAnnotation("Saccharomyces cerevisiae", tempdir(), assembly_type = "toplevel")

I run into this error:

Starting toplevel genome retrieval of Saccharomyces_cerevisiae from ensembl: 
Found organism but given release number did not specify existing file                                                             
                     in ensembl, maybe it is too old? Check that it exists on ensembl
                     first at all.
Remember some small genome organisms like yeast, does not have primary assemblies, then change assembly_type to toplevel and/or usedb = refseq.
Error in file.exists(filename) : invalid 'file' argument

This also happens when trying to download 'Danio rerio' or even 'homo spaiens'.

Roleren commented 1 year ago

First of, the ORFik experiment config is for much more than file naming, among other linking libraries to annotation, organism, automatic loading of files, etc But it is not required by ORFik, but makes things a lot simpler down the road.

If you have a folder with custom files, just run create.experiment on that folder, and add sample information, organism etc. You get a csv file you can edit if needed.

Then run the function list.experiments when you are done to see which experiments you have made.

Roleren commented 1 year ago

Secondly, about getGenomeAndAnnotation, this is most likely a curl bug upstream of ORFik. Maybe in the biomartr package, I will check if I can push a fix for that.

Roleren commented 1 year ago

Ah, yes, there was a bug I made a Pull request to biomartr a while back, it has still not propagated to CRAN, so you have to do: For me it works, since I use devel of biomartr (Nothing I can do for now, until it will push to CRAN in a few weeks)

For it to work for you for now, do:

devtools::install_github("ropensci/biomartr")

Then you get the version with the patch I made.

yeroslaviz commented 1 year ago

If you have a folder with custom files, just run create.experiment on that folder, and add sample information, organism etc. You get a csv file you can edit if needed.

I would really appreciate it, if you can send me a template file with the necessary columns, so that i can input my own data into for the analysis.

can I run the create.experiment on a folder with fastq files or should the files be already mapped (bam files)? I guess I will than need to first align them using STAR.align.folder.

Roleren commented 1 year ago

Check how I did it in the Ribo-seq vignette. First map reads, the run create.experiment on the bam folder and insert valid types.

My guess is for you that libtype = "RFP" (that is riboseq), and condition = "WT" and some experimental condition ? And set rep = c(1,2,3,1,2,3) if you have a 2 by 3 design etc.

ORFik tries to auto-guess the types, so if you have names like Ribo_WT_rep1, then it will just figure it out. Let me know what it outputs, then I can help you if you think it is wrong :)

yeroslaviz commented 1 year ago

I would appreciate the help, as it seems that the mapping was really bad. I have managed to map only a very small fraction of the reads to the S. cerevisiae (< 3%). It is the correct organism, as I am using a public data set (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE114882).

For testing reasons, I am using only four samples for the run to be done quicker. These are steps I have done so far (If you like, I can also share the Rmd File i have for this analysis with you):

  1. created the conf object
  2. downloaded the genomic annotations
  3. indexed the genome
  4. Mapping

It seems that mapping to the rRNA is not an issue, as less than 10% of the reads were mapped to the contaminant runs. one factor might be the fact that when using the auto option in the aligner step, fastp doesn't recognise the presence of the adapter and therefore doesn't trim anything. I will change the setting to the adapter sequence ( standard TruSeq sequence - AGATCGGAAGAGC). Or is this also a versioning thing? (I'm using 0.23.2).

Roleren commented 1 year ago

The fastp version should not matter I think, I think my warning system of failed adapter detection is a bit weak, but when you know it, it works well. So what you do is set it to the TRuSeq adapter sequences, yes.

Then you mapping rate should increase considerably.

The reason this happens is this: 1.The adapters are not detected, therefor not trimmed 2.STAR in local alignment (the default) requires 50% of the read to map, but this will fail a lot for you, since adapters are often ~ 50% of the read. 3.Your result will be a very poor mapping rate.

yeroslaviz commented 1 year ago

well, unfortunately not. I have tested two things now.

  1. as said above, I have added the adapter seq to the command.
  2. I have used cutadapt to trim the rads using the same adapter seq and then run the same command, but only with the steps = "co-ge" parameter.

While the run with cutadapt mapped still around 3%, the one with fastp and the adapter sequence given have increased the results to only around 7%.

So there must be something other than that wrong in the data. I am trying to see if the data contains any contamination, or maybe even a different organism.

yeroslaviz commented 1 year ago

So, as you can see in the image below, mapping against S. cerevisiae is correct. But I guess, the reason for the low mapping results is the very high number of duplications.

The question is, does it make sense to increase the number of allowed multi-mapped to increase the mapping results?

GSM3152885_screen 2

Roleren commented 1 year ago

There the alignment is 50%, so that is great actually.

Where did you get the 3% and 7% values earlier ? Did you collapse reads before you aligned ?

yeroslaviz commented 1 year ago

Yes, this is what makes me wonder as well.
The image above was made with fastq-screen (do you know it?). The mapping was done on the raw fastq files with no trimming or filtering and the mapping is done with bowtie2.

The 3% and 7% I got from the normal runs, the first with the adapter set to "auto" and the second when the adapter sequence was given in full.

Increasing the multi-mapped reads to 30 didn't increase the results significantly as expected.

Not really sure what to do next.

yeroslaviz commented 1 year ago

Just a thought. Would it maybe make sense to try and map the samples against the transcriptome and not genome?

Roleren commented 1 year ago

Hm, I think something is off in the call to STAR, some argument that should not be there is included. Bowtie2 should never outperform STAR on spliced RNA data (like Ribo-seq). Could you send me the console output when you run ? (just rerun and abort after 2 sec, when all init console output is there, so I can verify all arguments)

yeroslaviz commented 1 year ago

Here is the command I'm using to execute the mapping

STAR.align.folder(input.dir = conf["fastq Ribo-seq"], output.dir = paste0(conf["bam Ribo-seq"],"AGATCGGAAGAGC_trim_multiMapped100/"), index.dir = index,
                  star.path = "/fs/home/yeroslaviz/miniconda3/envs/Mapping/bin/STAR",
                  fastp     = "/fs/home/yeroslaviz/miniconda3/envs/Mapping/bin/fastp", 
                  paired.end = FALSE,
                  steps = "tr-co-ge", # (trim needed: adapters found, then genome)
                  keep.contaminants = FALSE, 
                  max.cpus = 15, # number of threads to use
                  adapter.sequence = "AGATCGGAAGAGC", # Adapters are auto detected
                  trim.front = 0, min.length = 20,
                  max.multimap = 100)

the complete output of this command is also attached here. STAR.align.folder_Output.txt

yeroslaviz commented 1 year ago

I played with the STAR parameters externally to see if I can increase the #mapped reads and I have (for now) ended up with this command

STAR --runThreadN 12 --genomeDir STARindex --readFilesIn fastq.gz 
--outFileNamePrefix file. --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 7062663806 
--winAnchorMultimapNmax 400 --outFilterMultimapNmax  200  
--seedSearchStartLmax  10 --outFilterMismatchNoverLmax 0.05 
--outFilterMatchNmin 0 --outFilterScoreMinOverLread 0 
--outFilterMatchNminOverLread 0 --alignIntronMax 1

The command takes as input the fastq file after removing the adapter with cutadapt and filtering out reads which mapped to rRNA (this mapping is done using bowtie).

the results of the mapping is this:

                                 Started job on |   Jan 20 10:28:35
                             Started mapping on |   Jan 20 10:28:37
                                    Finished on |   Jan 20 10:45:59
       Mapping speed, Million of reads per hour |   94.67

                          Number of input reads |   27402469
                      Average input read length |   40
                                    UNIQUE READS:
                   Uniquely mapped reads number |   12520204
                        Uniquely mapped reads % |   45.69%
                          Average mapped length |   16.40
                       Number of splices: Total |   0
            Number of splices: Annotated (sjdb) |   0
                       Number of splices: GT/AG |   0
                       Number of splices: GC/AG |   0
                       Number of splices: AT/AC |   0
               Number of splices: Non-canonical |   0
                      Mismatch rate per base, % |   0.18%
                         Deletion rate per base |   0.00%
                        Deletion average length |   1.00
                        Insertion rate per base |   0.00%
                       Insertion average length |   1.21
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |   14873457
             % of reads mapped to multiple loci |   54.28%
        Number of reads mapped to too many loci |   0
             % of reads mapped to too many loci |   0.00%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |   8797
       % of reads unmapped: too many mismatches |   0.03%
            Number of reads unmapped: too short |   0
                 % of reads unmapped: too short |   0.00%
                Number of reads unmapped: other |   11
                     % of reads unmapped: other |   0.00%
                                  CHIMERIC READS:
                       Number of chimeric reads |   0
                            % of chimeric reads |   0.00%

This mapping was done against a specifically made transcriptome for the RiboSeq analysis with bases upstream of the TSS added to each of the sequences. so It cannot be used automatically with ORFik.

How can I add these parameters to the ORFik script to be run within the package?

thanks

Roleren commented 1 year ago

Could you rerun the same input file after rRNA removal with ORFik default Star version 2.7.4a. and the internal orfik script (not your version) ? Need to know if it is the version of star or arguments where something is off.

Roleren commented 1 year ago

STAR only takes an index during mapping, give the transciptome index instead and it should work in ORFik.

Roleren commented 1 year ago

Transciptome mapping should never align better than genome mapping, so something is going on here, something strange..

yeroslaviz commented 1 year ago

well, this is embarrassing.

I have solved the problem, nothing technical or complicated. The biologist gave me the wrong adapter. We are working on two data sets in parallel one for S. cerevisiae and one for C. elegans. Apparently I have used the wrong adapter here. Instead of AGATCGGAAGAGC I needed to use CTGTAGGCACCATCAAT.

Now I get these results:

Final statistics:
       sample sample_id total mapped reads %-genome total mapped reads #-genome ...
1: GSM3152885         1                       80.98                    10820008 ...
2: SRR7214394         2                       81.98                     3812062 ...
3: SRR7214409         3                       85.82                    11535207 ...
4: SRR7214411         4                       85.70                    12337176 ...

I'm truly sorry for wasting your time for such trivialities. At least i have learned a lot in the process about your tool as well as how STAR works nicely for very short reads.

I'm closing this issue for now, but I am sure I'll contact you again downstream in the analysis.

thanks for all the help

Assa

Roleren commented 1 year ago

Haha, you are not the first to do this, I have done this myself before.

The really cool thing is if you had used massiveNGSpipe (the package we will make public soon), it catches this by default. Since it can even find correct adapter in a multiadapter environment. Will let you know when we release it in a few days.

We might even include a shiny GUI so you can keep track of progress of processing (we could in theory connect it to discord etc, so you could get a status plot or a ping when something is done, a lot of cool possibilities).

yeroslaviz commented 1 year ago

This sounds cool. Would love to test it. let me know