Input file error when running crossmapper tool on RNAseq reads

Justin1609 commented 3 years ago

Hi there, very new to bioinformatics and trying to do analyze RNAseq reads of three similar species of yeast using cross mapper. I would like to see if this software can give an indication of whether or not these yeasts can in fact be analyzed together in one sample without too many reads mapping to the incorrect genomes. I came across cross mapper whilst trying to solve this issue. However, I am getting the following error when I try and run my own data in a similar way as what is described in the tutorial examples.

"Error: EXAMPLE_INPUT.fa file does not exist! Please provide a valid file."

I followed Example 2 of the tutorial for cross mapper and input my own read files, which were in fastq format so I converted to fasta format using seqtk tool. I also used the gtf files of the relevant yeast species I am studying.

I apologize if this is the incorrect way to ask a question, as I said I am completely new to this field and not aware of best practices concerning asking for assistance. If anyone does have some advice for what could be causing the issue I would greatly appreciate it.

Lastly I have installed anaconda and installed the cross mapper software correctly. In addition, I am running these analyses on a VM using a linux OS.

GrantHov commented 3 years ago

Hi,

The error indicates that the input file does not exist or the fasta format is not valid. Please make sure that you navigate to a proper fasta file when running Crossmapper. For simplicity, try putting all the input files in one directory and run Crossmapper in that directory. If it still doesn't help then you can send me the fasta files and I can see what is going on.

Cheers Hrant

Justin1609 commented 3 years ago

Hi Grant

Thanks, I tried another tool to convert my fastq files to fasta format now and it seems to be working as it has accepted the input files now, but if I have any more issues I will ask again. Again thank you very much for the quick response, I truly appreciate it.

Kind regards

Justin

GrantHov commented 3 years ago

You are welcome and hope Crossmapper can be useful for your work. Btw, if reference genomes of that yeasts exist, I'd recommend to run Crossmapper with reference genomes rather than creating and using your own fasta files. This way crossmapper will simulate reads from reference genomes, map the reads back to the references and will report crossmapping rates

Justin1609 commented 3 years ago

Hi Grant,

I seem to be getting a new error now indicating that the cat CMD cannot execute and then it reports that the execution failed. I have attached the output error files. cat_stderr.txt crossmap.log

I am not sure what the issue is as the "-o" option would only allow you to specify the name of the output folder, but it seems almost as if the issue is that the program needs the INPUT.fa file to have a output file specified with the same name just edited slightly to write the data to, for example INPUT_out.fa. Below is an example of the code I ran:

crossmapper RNA -g INPUT.fa -a REF_YEAST_GENOME.gtf -rlay SE -rlen 75,100 -N 30000000 -r 0.03 -gb -t 10

And thank you for the tip, I truly appreciate it. So are you saying instead of running my reads I generated from RNAseq analysis, I could rather just simply take the genomes of the three yeast species I am working with and input them into crossmapper? Will the program then simulate its own reads and tell me if cross mapping will be a major problem in analyzing the RNAseq reads of these 3 yeast species together?

As I ultimately want to map the reads to a chimeric reference genome, which consists of the 3 genomes of the yeast species concatenated together. This is because I have samples containing all the yeasts mixed together, which I would like to analyze together to determine what this interaction between species does to gene expression.

If you have any thoughts regarding the issue with the program error, or the purpose of what I would like to use crossmapper for I would greatly appreciate your input.

Kind regards

J

GrantHov commented 3 years ago

From the logs you sent it seems that the software just does not find the input files that you supply. Are you sure you are specifying the correct path to them? In any case, I highly recommend to run Crossmapper as I suggested in my previous reply. Download reference genomes and genome annotations of that species (fasta and gff files should be from the same source so the chromosome names coincide, otherwise you will get problems downstream), put everything in one directory if you are not sure how to specify paths to them, and then from that directory run the Crossmapper (following say the Example 2 or 3). The way you try to do it now has many issues (you basically will simulate reads from reads, and then map those reads to the entire set of reads, which is not sensical), plus the gff files are not going to match any chromosome (because obviously your fasta file does not contain any). The main aim of Crossmapper is to let people to look into the problem of crossmapping before doing actual sequencing experiment. If you already got a sample which contains three species, you can already proceed with mapping these data to concatenated reference genomes and try performing the analysis you want. Of course you also can run Crossmapper which at this point will just tell you whether or not the crossmapping is an issue, and if not then you can safely trust your results in theory. If it turns out that crossmapping is an issue and you have a lot of crossmapped reads, then one possible way to try to alleviate it is playing around with the parameters of the mapping software and see how those parameters influence (possibly reduce) the crossmapping (although this could be a complicated analysis to do).

Justin1609 commented 3 years ago

Hi Grant,

Okay thank you very much, I will use the fasta files and gff files from the genomes of the yeasts I downloaded from NCBI database. I have already generated the data as you mentioned, and the exact point you make at the end is why I was hoping to use crossmapper as a means to validate that all the analyses I have done thus far would be alright. I was hoping to use crossmapper results as an argument as to why my analyses should be fine, if the results from crossmapper show that there is not a high degree (or percentage) of cross mapping that occurs between the different species. I was also hoping to avoid the complicated analysis, as you mentioned, for trying to increase the mapping specificity because as I mentioned I am not very proficient in this field yet. I will try your suggestion now though, as if it indicates even with that data that cross mapping is an issue between those species then I know I will have to redo my analyses I have performed thus far somehow. I truly appreciate all of your help thus far though.

Kind regards

J

ahmedihafez commented 3 years ago

From the logs you sent it seems that the software just does not find the input files that you supply. Are you sure you are specifying the correct path to them? In any case, I highly recommend to run Crossmapper as I suggested in my previous reply. Download reference genomes and genome annotations of that species (fasta and gff files should be from the same source so the chromosome names coincide, otherwise you will get problems downstream), put everything in one directory if you are not sure how to specify paths to them, and then from that directory run the Crossmapper (following say the Example 2 or 3). The way you try to do it now has many issues (you basically will simulate reads from reads, and then map those reads to the entire set of reads, which is not sensical), plus the gff files are not going to match any chromosome (because obviously your fasta file does not contain any). The main aim of Crossmapper is to let people to look into the problem of crossmapping before doing actual sequencing experiment. If you already got a sample which contains three species, you can already proceed with mapping these data to concatenated reference genomes and try performing the analysis you want. Of course you also can run Crossmapper which at this point will just tell you whether or not the crossmapping is an issue, and if not then you can safely trust your results in theory. If it turns out that crossmapping is an issue and you have a lot of crossmapped reads, then one possible way to try to alleviate it is playing around with the parameters of the mapping software and see how those parameters influence (possibly reduce) the crossmapping (although this could be a complicated analysis to do).

Hi Justin and Hrant, It seems like the error as indicated by Hrant that the tools can not find some of the input files. I think this is caused by "white spaces in your path" which is name RNAseq data analysis. For the moment try renaming all folder in our path not to have white space, we will fix the internal issue with tool to accept paths with spaces in future update. Sorry of this inconvenience,

Thanks,

Justin1609 commented 3 years ago

Hi Ahmed and Grant,

I seem to still not be able to run the tutorial example using data obtained off of NCBI database for the different yeast species I am studying. I am getting an issue which has been mentioned in the troubleshooting link regarding the wgsim program. I have attached the error files. cmd_wgsim_stderr.txt cmd_wgsim_stdout.txt cat_stderr.txt error.txt

I am performing these analyses in a VM that is using Linux OS. Could this be an issue at all? I am just not sure at this point what the issue could be? I am running crossmapper from the same directory as where the input files are located and I have run the command to fix the wgsim issue as indicated in the troubleshooting section, however, it still doesn't seem to be working. Any advice you could offer me would be greatly appreciated.

Kind regards

Justin

ahmedihafez commented 3 years ago

Hi Justin, It looks like the first problem with white space has gone and this is another problem with missing library similar to the one in troubleshooting section. Can you let us now which version of Linux you are using, and if you do not mind can you run the following two commands and send us the bio-env.txt and bio-evn.yaml files generated by the commands, It will contains information about all dependency installed in your conda env, that will help us resolve the issue.

conda list --explicit > bio-env.txt
## and this two
conda env export > bio-evn.yaml

Many thanks for your help.

Justin1609 commented 3 years ago

Hi Ahmed

I am running Ubuntu version 20.04.2 LTS on virtualbox by oracle. I have attached the two files you requested. I had to zip the YAML file, as you cannot upload that file type apparently. Any help you can offer would be greatly appreciated. I look forward to hearing from you soon.

Kind regards

Justin bio-env.txt bio-env.yaml.gz

ahmedihafez commented 3 years ago

Many thanks @Justin1609 , We will look into the files and will get back to you as soon as possible.

ahmedihafez commented 3 years ago

Hi @Justin1609,The problem here is a dependency of wgsim tool that require an older version of openssl library. To resolve this for the moment until we fix it, install the older library manually inside your conda environment by running the following command.

## inside you conda env
conda install https://anaconda.org/conda-forge/openssl/1.0.2u/download/linux-64/openssl-1.0.2u-h516909a_0.tar.bz2

Let me know if this fix the problem. Thanks

Related to #2

Justin1609 commented 3 years ago

Hi there Ahmed

Thanks your solution worked great, however, I did not have enough disk space to run the analysis on my home computer and therefore I tried to run it on a remote host. However, I am now having the same error as before. I have spoken with the server administrator who has followed all of the steps that we have performed here. It seems though, that it is still giving issues though. If you have any idea why this would be the case, I would greatly appreciate your input.

Kind regards

J

GrantHov commented 3 years ago

Hi Justin,

What are the issues that you get in the remote?

Meanwhile I'd recommend to try running Crossmapper locally specifying less reads, say 5-10 mln (also you can check -C option (i.e. coverage), with this option Crossmapper with generate as many reads to hit the coverage value C that you specify ), and specifying only the read length that you have in your real data.

Justin1609 commented 3 years ago

Hi Grant,

I am getting the same issue I used to have in the VM before I ran the command Ahmed mentioned before. The system admin I am talking to suggested this:

"I would explicitly add the path to my path statement to be sure that the environment is picking these things up correctly.

with an export PATH=whatever/:$PATH"

And he also changed the ownership of the files in the virtual environment.

He has also run the command that Ahmed mentioned previously https://github.com/Gabaldonlab/crossmapper/issues/3#issuecomment-876968081

I am still getting the following issues that I have attached though.

cmd_wgsim_stderr.txt cat_stderr.txt

Any suggestions on how to fix this would be much appreciated.

Also thank you for the advice with the "-C" option, but how would I determine the read length of each read?

GrantHov commented 3 years ago

Thanks we will look into that.

Regarding -C option, you were saying that you already have the sequencing data. If so, then you can check the read length of your samples by running for example fastqc https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Justin1609 commented 3 years ago

Hi Grant

Alright great, thank you very much I truly appreciate it.

Kind regards

Justin

ahmedihafez commented 3 years ago

Hi @Justin1609 , The manual solution to download the library was specific to your case based on OS and conda env, it could be different in the remote host. If you do not mind, run again the following commands on the remote host within your conda env and send us result,

## as before run
conda list --explicit > bio-env.txt
## and this two
conda env export > bio-evn.yaml

then after that run the following command and let conda resolve update openssl library

conda install -c conda-forge openssl
## then again run 
conda list --explicit > bio-env-update.txt
## and this two
conda env export > bio-evn-update.yaml

and you may also test after the update if wgsim work or not by running

wgsim -h
## and let me know the error if did not work

Justin1609 commented 3 years ago

Hi Ahmed

Thank you for your assistance, I truly appreciate it a lot. The system administrator of the remote server I use managed to solve the issue and the analysis ran to completion. It had to do with specifying certain PATH variables. Again thank you for all of your assistance in this matter.

Kind regards

Justin

Justin1609 commented 3 years ago

Good morning

Sorry to bother you yet again, I am trying to run the first example (see below) of crossmapper on certain species. However, I am getting an error as I am attempting to run the following command on a remote host:

crossmapper RNA -g Yeast_A.fasta Yeast_B.fasta Yeast_C.fasta -a Yeast_A.gff Yeast_B.gff Yeast_C.gff -gn Yeast_A Yeast_B Yeast_C -r 0.01 -gb -C 360 -rlen 75,100 -rlay both -o output_folder -t 10 -max_mismatch 5

The error I am getting is from STAR as it requires a FIFO TMPs directory, but I am working on a linux server. I attached the error file, but I am not sure how to simply change the STAR settings to redirect the output. The server administrator said I had to do the following:

"The HPC is not a windows filing system so you have to use --outTmpDir=/ choose a location"

Could you perhaps explain how I can go about editing the STAR template so that I can specify the output directory, which will hopefully fix the error.

STAR_mapping_stderr.txt

GrantHov commented 3 years ago

Hi, for that specific issue we implemented the -star_tmp option (see the help page with crossmapper RNA -help), which let's you specify temporary directory for STAR files.

Justin1609 commented 3 years ago

Hi Grant

Thanks I appreciate it, will check it out now.

Justin1609 commented 3 years ago

Hi there Hrant

Truly apologise for bothering you again. However, I tried the -star_tmp option but I am now getting an error which I have attached. I asked my server administrator about it, his response was:

STAR_mapping_stderr.txt

"It wants to create the directory itself, else it fails."

(ii) if you specified --outTmpDir, and this directory exists - please remove it before running STAR

However, I am not sure then how to allow crossmapper to make its own directory, as -star_tmp requires a PATH to be specified or should I then just use -star_tmp like that without specifying a PATH? It seems by giving it a directory that causes the program to fail.

GrantHov commented 3 years ago

Hi, remove that temp directory, then create again, then try

STAR --runThreadN 10 --genomeDir /home/jasmus/Cleo_Raw_data/Crossmapper_RNA_reads_out/STAR_index --sjdbGTFfile /home/jasmus/Cleo_Raw_data/Crossmapper_RNA_reads_out/concat.gtf --sjdbOverhang 74 --readFilesIn /home/jasmus/Cleo_Raw_data/Crossmapper_RNA_reads_out/wgsim_output/concat_75_read1.fastq --readFilesCommand cat --outSAMtype BAM Unsorted --outFileNamePrefix /home/jasmus/Cleo_Raw_data/Crossmapper_RNA_reads_out/STAR_output/concat_75SE --outFilterMismatchNmax 5 --outFilterMultimapNmax 10000 --outFilterMismatchNoverReadLmax 0.04 --alignIntronMax 0 --outTmpDir /home/jasmus/Cleo_Raw_data/Crossmapper_RNA_reads_out/tmps

On Tue, Jul 27, 2021, 19:40 Justin1609 @.***> wrote:

Hi there Hrant

Truly apologise for bothering you again. However, I tried the -star_tmp option but I am now getting an error which I have attached. I asked my server administrator about it, his response was:

STAR_mapping_stderr.txt https://github.com/Gabaldonlab/crossmapper/files/6887595/STAR_mapping_stderr.txt

"It wants to create the directory itself, else it fails."

(ii) if you specified --outTmpDir, and this directory exists - please remove it before running STAR

However, I am not sure then how to allow crossmapper to make its own directory, as -star_tmp requires a PATH to be specified or should I then just use -star_tmp like that without specifying a PATH? It seems by giving it a directory that causes the program to fail.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Gabaldonlab/crossmapper/issues/3#issuecomment-887703637, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACBC2BCSJYOR3NUL3QA7WFLTZ3VQRANCNFSM476KJDOA .

Justin1609 commented 3 years ago

Hi Hrant

Do I need to create a custom mapper template with what you have suggested in the previous response? And if so how would I go about doing that?

I checked the online help guide but not sure exactly how to go about creating the template? Would it just start with crossmapper --mapper-template and then enter the above code, or do you have to create it using another process? Also where would you then store the custom mapper template?

Justin1609 commented 3 years ago

Hi Hrant

Could you send me the code you suggested as a .yaml file please, I would like to try and specify that as the -mapper--template and see if that will solve the issue. I am assuming the -mapper--template file (e.g. STAR.yaml) would then have to be in the miniconda library and I would have to use export PATH=/apps/miniconda/lib:$PATH to specify the PATH to the library before activating the venv and running crossmapper again?

GrantHov commented 3 years ago

Hi Sorry, my bad, I have replied to you with a STAR command instead of the corssmapper command.

Please try the following:

Create a directory in your default directory mkdir ~/TMP
Run crossmapper with -star_tmp option pointing ~/TMP/tmps (a directory that does not exist) crossmapper RNA -g Yeast_A.fasta Yeast_B.fasta Yeast_C.fasta -a Yeast_A.gff Yeast_B.gff Yeast_C.gff -gn Yeast_A Yeast_B Yeast_C -r 0.01 -gb -C 360 -rlen 75,100 -rlay both -o output_folder -t 10 -max_mismatch 5 -star_tmp ~/TMP/tmps

Let me know if you still get an error

Justin1609 commented 3 years ago

Hi Hrant

It is still giving the same error. I have pasted the errors from the terminal as well as the error file. Any suggestions would be greatly appreciated. STAR_mapping_stderr.txt

[2021-07-28 18:59:05] ERROR : See error log in /home/jasmus/Cleo_Raw_data/crossmap_out/STAR_output/STAR_mapping_stderr.txt Traceback (most recent call last): File "/apps/miniconda/envs/py36/bin/crossmapper", line 11, in load_entry_point('crossmapper==1.1.1', 'console_scripts', 'crossmapper')() File "/apps/miniconda/envs/py36/lib/python3.6/site-packages/crossmapper/crossmapper.py", line 469, in crossmapMain parsedArgs.mapper.run() File "/apps/miniconda/envs/py36/lib/python3.6/site-packages/crossmapper/mapper.py", line 340, in run self.runMapping() File "/apps/miniconda/envs/py36/lib/python3.6/site-packages/crossmapper/mapper.py", line 396, in runMapping tmpFile = self.doMap(self.parsedArgs.simulationOutputFiles[rlen],rlen,layout) File "/apps/miniconda/envs/py36/lib/python3.6/site-packages/crossmapper/mapper.py", line 504, in doMap raise ExecException("STAR Execution Failed.") crossmapper.mapper.ExecException: STAR Execution Failed.

Justin1609 commented 3 years ago

Hi Hrant,

Should I try running crossmapper on a VM with a linux OS and see if it will work there? I can't think of why it is still giving that error when I run it on the remote server.

GrantHov commented 3 years ago

Hi there,

One question, the temp directory that you specify is still on the remote server or in your local machine (for example with samba connection to the remote)? If you try and it works on a local machine than probably there is an issue with permissions, or some other more cryptic issue with STAR. In any case it seems to be a problem with STAR (not the crossmapper), so if it still persists I'd recommend to post the issue to STAR github https://github.com/alexdobin/STAR

Justin1609 commented 3 years ago

Hi @GrantHov and @ahmedihafez

The suggestion from @alexdobin was as follows:

Hi Justin,

this error is different - it cannot create FIFO files in the TMP directories. Please confirm it by running mkfifo /home/jasmus/TMP/tmps/aaa If it fails, it means that the partition is not one of the standard *nix partitions - is it FAT or NTFS? Then you need to point --outTmpDir to a Linux partition that supports FIFO files.

Alternatively, since you are using unzipped FASTQ files, you can omit --readFilesCommand cat from the STAR command line - then it will not use FIFO files.

Cheers Alex

The remote server I work on is not a windows file system. So I am not sure if redirecting --outTmpDir will work. Could you please explain how I can remove the --readFilesCommand cat from the STAR settings when I run crossmapper? Would I need to create a custom mapper template and if so how would I do this, as I am not familiar with .yaml coding?

ahmedihafez commented 3 years ago

Hi @Justin1609 Many thanks for your feedback. First, could you please let us know if executing mkfifo /home/jasmus/TMP/tmps/aaa work or not ?

Also to see if removing --readFilesCommand cat will work you can try the following command -- given you still have the previous output from Crossmapper --.

## First make sure you remove tmps folder
rm -rf ~/TMP/tmps/
## then run STAR test 
STAR --runThreadN 10 --genomeDir /home/jasmus/Cleo_Raw_data/crossmap_out/STAR_index --sjdbGTFfile /home/jasmus/Cleo_Raw_data/crossmap_out/concat.gtf --sjdbOverhang 74 --readFilesIn /home/jasmus/Cleo_Raw_data/crossmap_out/wgsim_output/concat_75_read1.fastq  --outSAMtype BAM Unsorted --outFileNamePrefix /home/jasmus/Cleo_Raw_data/crossmap_out/STAR_output/concat_75_SE_ --outFilterMismatchNmax 5 --outFilterMultimapNmax 10000 --outFilterMismatchNoverReadLmax 0.04 --alignIntronMax 0 --outTmpDir /home/jasmus/TMP/tmps

If it works then you can use a custom STAR mapper template, which we can provide for you.

Thanks

Justin1609 commented 3 years ago

Hi there @ahmedihafez

Thank you for the reply, I am not sure if I can generate a FIFO file on the remote server as it is not a windows file system. I ran the code you sent though however, still getting the same error.

STAR --runThreadN 10 --genomeDir /home/jasmus/Cleo_Raw_data/crossmap_out/STAR_index --sjdbGTFfile /home/jasmus/Cleo_Raw_data/crossmap_out/concat.gtf --sjdbOverhang 74 --readFilesIn /home/jasmus/Cleo_Raw_data/crossmap_out/wgsim_output/concat_75_read1.fastq --outSAMtype BAM Unsorted --outFileNamePrefix /home/jasmus/Cleo_Raw_data/crossmap_out/STAR_output/concat_75SE --outFilterMismatchNmax 5 --outFilterMultimapNmax 10000 --outFilterMismatchNoverReadLmax 0.04 --alignIntronMax 0 --outTmpDir /scratch2/TMP/tmps

EXITING because of fatal ERROR: could not make temporary directory: /scratch2/TMP/tmps/ SOLUTION: (i) please check the path and writing permissions (ii) if you specified --outTmpDir, and this directory exists - please remove it before running STAR

Aug 06 12:22:24 ...... FATAL ERROR, exiting

Any advice would be greatly appreciated, and thank you for the help thus far I truly appreciate it.

Kind regards

Justin

ahmedihafez commented 3 years ago

mkfifo is a linux command, so you should be able to run it on the remote server. Running the following command in you home dir on the remote server will confirm that to us.

## first make sure that the tmp dir are there
mkdir -p ~/TMP/tmps
mkfifo ~/TMP/tmps/aaa
## ls (list file in ~/TMP/tmps/ to see if aaa file is created)
ls ~/TMP/tmps/
## then final remove ~/TMP/tmps/ dir again for STAR to work
rm -rf ~/TMP/tmps/

And let us know if the file `aaa was created or not.

As for STAR command you have changed the tmp dir again from --outTmpDir /home/jasmus/TMP/tmps to --outTmpDir /scratch2/TMP/tmps so please revise it or just keep as --outTmpDir ~/TMP/tmps So it should like this

## first make sure that the tmp dir are there if not yet
mkdir -p ~/TMP/
##  make sure you remove tmps folder if already exist
rm -rf ~/TMP/tmps/
## then run STAR test 
STAR --runThreadN 10 --genomeDir /home/jasmus/Cleo_Raw_data/crossmap_out/STAR_index --sjdbGTFfile /home/jasmus/Cleo_Raw_data/crossmap_out/concat.gtf --sjdbOverhang 74 --readFilesIn /home/jasmus/Cleo_Raw_data/crossmap_out/wgsim_output/concat_75_read1.fastq  --outSAMtype BAM Unsorted --outFileNamePrefix /home/jasmus/Cleo_Raw_data/crossmap_out/STAR_output/concat_75_SE_ --outFilterMismatchNmax 5 --outFilterMultimapNmax 10000 --outFilterMismatchNoverReadLmax 0.04 --alignIntronMax 0 --outTmpDir ~/TMP/tmps

It is important to keep --outTmpDir ~/TMP/tmps to see if it works or not.

Justin1609 commented 3 years ago

Hi @ahmedihafez

Apologies, I had to get help from my system admin as it was an issue with the remote server that I use to run crossmapper. The issue was that the home directory I was attempting to create the fifo file in is a gluster mount so creating special files over the gluster mount was where it was failing. I have since attempted to run it on a local file system, but I am now having an issue with STAR, which I have raised with Alex Dobin here https://github.com/alexdobin/STAR/issues/1323#issue-964913187

I actually wanted to ask what version of STAR crossmapper uses? I think I may need to recreate the reference genome with the same version of STAR to solve this issue. If I do need to do that though, how would I go about recreating the reference genome so that it can complete the remainder of the crossmapper output from the wgsim files that have have already been created?

Justin1609 commented 3 years ago

Hi there @ahmedihafez and @GrantHov

I managed to get STAR to successfully complete the run using these parameters:

STAR --runThreadN 10 --genomeDir /home/jasmus/Cleo_Raw_data/crossmap_out/STAR_index --sjdbGTFfile /home/jasmus/Cleo_Raw_data/crossmap_out/concat.gtf --sjdbOverhang 74 --readFilesIn /home/jasmus/Cleo_Raw_data/crossmap_out/wgsim_output/concat_75_read1.fastq --outSAMtype BAM Unsorted --outFileNamePrefix /home/jasmus/Cleo_Raw_data/crossmap_out/STAR_output/concat_75SE --outFilterMismatchNmax 5 --outFilterMultimapNmax 10000 --outFilterMismatchNoverReadLmax 0.04 --alignIntronMax 0 STAR version: 2.7.9a_2021-06-25 compiled: 2021-06-25T15:53:52-04:00 :/home/dobin/data/STAR/STARcode/STAR.master/source Aug 17 11:43:05 ..... started STAR run Aug 17 11:43:05 ..... loading genome Aug 17 11:43:09 ..... processing annotations GTF Aug 17 11:43:10 ..... inserting junctions into the genome indices Aug 17 11:43:12 ..... started mapping Aug 17 14:49:11 ..... finished mapping Aug 17 14:49:15 ..... finished successfully

My question now is, how do I get the crossmapper output now for the analysis? Any help would be greatly appreciated. Look forward to hearing from you.

Justin1609 commented 3 years ago

@ahmedihafez and @GrantHov should I maybe just run the crossmapper command again now that the STAR alignment was successful and will this then generate the crossmapper output file?

Also I have wgsim files for read lengths of 100 bp as well, so should I first run STAR again for those reads using the following command?

STAR --runThreadN 10 --genomeDir /home/jasmus/Cleo_Raw_data/crossmap_out/STAR_index --sjdbGTFfile /home/jasmus/Cleo_Raw_data/crossmap_out/concat.gtf --sjdbOverhang 99 --readFilesIn /home/jasmus/Cleo_Raw_data/crossmap_out/wgsim_output/concat_100_read1.fastq --outSAMtype BAM Unsorted --outFileNamePrefix /home/jasmus/Cleo_Raw_data/crossmap_out/STAR_output/concat_100SE --outFilterMismatchNmax 5 --outFilterMultimapNmax 10000 --outFilterMismatchNoverReadLmax 0.04 --alignIntronMax 0

GrantHov commented 3 years ago

Hi Justin, good to see that STAR finally worked for you. I'd recommend now just running crossmapper again (in theory if STAR works then crossmapper also should). You can run just a toy example of crossmapper (just specifying say 1000 reads to simulate) to check whether it works

Justin1609 commented 3 years ago

Hi @GrantHov alright great thank you very much, I will give that a try now

Justin1609 commented 3 years ago

Hi there @ahmedihafez and @GrantHov

I tried crossmapper again and it ran for quite a while, but now I got a new error again involving STAR

[jasmus@n05 /scratch/crossmap_out/STAR_output]$ cat STAR_mapping_stderr.txt ######################### STAR --runThreadN 10 --genomeDir /scratch/crossmap_out/STAR_index --sjdbGTFfile /scratch/crossmap_out/concat.gtf --sjdbOverhang 74 --readFilesIn /scratch/crossmap_out/wgsim_output/concat_75_read1.fastq --readFilesCommand cat --outSAMtype BAM Unsorted --outFileNamePrefix /scratch/crossmap_out/STAR_output/concat_75SE --outFilterMismatchNmax 5 --outFilterMultimapNmax 10000 --outFilterMismatchNoverReadLmax 0.04 --alignIntronMax 0 --outTmpDir ./TMPs ######################### ######################### STAR --runThreadN 10 --genomeDir /scratch/crossmap_out/STAR_index --sjdbGTFfile /scratch/crossmap_out/concat.gtf --sjdbOverhang 74 --readFilesIn /scratch/crossmap_out/wgsim_output/concat_75_read1.fastq /scratch/crossmap_out/wgsim_output/concat_75_read2.fastq --readFilesCommand cat --outSAMtype BAM Unsorted --outFileNamePrefix /scratch/crossmap_out/STAR_output/concat_75PE --outFilterMismatchNmax 5 --outFilterMultimapNmax 10000 --outFilterMismatchNoverReadLmax 0.04 --alignIntronMax 0 --outTmpDir ./TMPs #########################

EXITING because of fatal error: buffer size for SJ output is too small Solution: increase input parameter --limitOutSJcollapsed

Aug 20 00:45:14 ...... FATAL ERROR, exiting

Could you please advise on how I can resolve this new error? Look forward to hearing from you soon.

Justin1609 commented 3 years ago

Hi @GrantHov and @ahmedihafez

I have found a similar error reported for STAR before on the git page for STAR, here is the link https://github.com/alexdobin/STAR/issues/778#issue-522051696

I think I will need a custom STAR mapper template for the analysis I am trying to run, unless you could explain how I can run crossmapper with the edits for STAR that I would need to make?

alexdobin commented 3 years ago

Hi Justin,

as the error message states, you need to increase --limitOutSJcollapsed.

Cheers Alex

Justin1609 commented 3 years ago

Hi @alexdobin

Thank you, yes I would like to try changing that parameter but I am not familiar with .yaml language, so I am hoping that @ahmedihafez or @GrantHov could provide me with a mapping template to run with crossmapper when I run it so that it seamlessly runs to completion.

GrantHov commented 3 years ago

Hi,

below you have contents of a template file:

type: RNA dep: bioconda/star mapper_name : STAR output_type : bam sorted : no outputfile_pattern: "{outputfile_prefix}_Aligned.out.bam" template: index :

"STAR --runMode genomeGenerate --runThreadN {n_threads} --genomeDir {ref_dir} --genomeFastaFiles {ref_fasta} --genomeSAindexNbases 11" both :

"rm -rf {tmp_dir}"

"STAR --genomeDir {ref_dir} --sjdbGTFfile {gtf} --readFilesIn {1} {2} --readFilesCommand cat --outFileNamePrefix {output_dir}/{outputfileprefix} --outSAMtype BAM Unsorted --outTmpDir {tmp_dir} --sjdbOverhang 49 --outFilterMismatchNmax 100 --outFilterMultimapNmax 10000 --outFilterMismatchNoverReadLmax 0.04 --alignIntronMax 0"

copy and paste it to a file named template.txt. In the last line after --outTmpDir {tmp_dir} you can add/remove whatever STAR option you need (but don't change the options coming before). Also change the value --genomeSAindexNbases 11 in "index" if needed.

when ready, run crossmapper as below:

crossmapper RNA -g fasta1.fa fasta2.fa -a annot1.gff annot2.gff -gn name1 name2 -gb -N 1000 1000 -rlen 50 -rlay both -o output_folder -t 2 --mapper-template template.txt

If this test run finishes successfully, you can increase the number of reads and change other parameters. Note that if you need to change any STAR option you need to do it in the template. hope this helps

Justin1609 commented 3 years ago

Hi there @GrantHov

Thank you very much, I appreciate it. So just to make sure I understand correctly, I can create template.txt with nano text editor and then copy and paste the info above into it? But I just need to add the fix for the error I was getting previously just after --outTmpDir [tmp_dir}, so it would look something like this?

type: RNA dep: bioconda/star mapper_name : STAR output_type : bam sorted : no outputfile_pattern: "{outputfile_prefix}Aligned.out.bam" template: index :

"STAR --runMode genomeGenerate --runThreadN {n_threads} --genomeDir {ref_dir} --genomeFastaFiles {ref_fasta} --genomeSAindexNbases 11" both :
"rm -rf {tmp_dir}"
"STAR --genomeDir {ref_dir} --sjdbGTFfile {gtf} --readFilesIn {1} {2} --readFilesCommand cat --outFileNamePrefix {output_dir}/{outputfile_prefix} --outSAMtype BAM Unsorted --outTmpDir {tmp_dir} --limitOutSJcollapsed 10000000 --sjdbOverhang 49 --outFilterMismatchNmax 100 --outFilterMultimapNmax 10000 --outFilterMismatchNoverReadLmax 0.04 --alignIntronMax 0"

GrantHov commented 3 years ago

yes that's correct. One minor typo though - the line outputfile_pattern: should be

outputfile_pattern: "{outputfile_prefix}_Aligned.out.bam"

(i.e. there is an _ between {outputfile_prefix} and Aligned.out.bam)

Justin1609 commented 3 years ago

@GrantHov okay thanks so much, I really appreciate it a lot.

Justin1609 commented 3 years ago

Hi @GrantHov

I tried the solution you provided out, however, I am getting the following error:

[jasmus@n05 ~]$ cd /scratch [jasmus@n05 /scratch]$ export PATH=/apps/miniconda/bin:/apps/miniconda/envs:$PATH [jasmus@n05 /scratch]$ export LD_LIBRARY_PATH=/apps/miniconda/lib:$LD_LIBRARY_PATH [jasmus@n05 /scratch]$ export PATH=/apps/miniconda/lib:$PATH [jasmus@n05 /scratch]$ source activate py36 (py36) [jasmus@n05 /scratch]$ crossmapper RNA -g S_cerevisiae.fna T_delbrueckii.fna L_thermotolerans.fna -a S_genomic.gff T_genomic.gff L_genomic.gff -gn S_cerevisiae T_delbrueckii L_thermotolerans -gb -N 1000 1000 1000 -rlen 75 -rlay both -o output_folder -t 2 --mapper-template template.txt -rc crossmapMain crossmapper [2021-08-23 16:45:33] INFO : Starting the program with "/apps/miniconda/envs/py36/bin/crossmapper RNA -g S_cerevisiae.fna T_delbrueckii.fna L_thermotolerans.fna -a S_genomic.gff T_genomic.gff L_genomic.gff -gn S_cerevisiae T_delbrueckii L_thermotolerans -gb -N 1000 1000 1000 -rlen 75 -rlay both -o output_folder -t 2 --mapper-template template.txt -rc" [2021-08-23 16:45:33] ERROR : Error Can not use Custom Mapper, argument of type 'NoneType' is not iterable Error Can not use Custom Mapper.

GrantHov commented 3 years ago

this can happen due to formatting issues in the template file (probably due to copy paste from github). i can send you the template if you give me your email

Justin1609 commented 3 years ago

Hi @GrantHov

My email is justinasmus16@gmail.com, and again thank you for your help I truly appreciate it. So, when I get the file from you, I can then just upload it to the remote server and then put it in the scratch directory and run crossmapper again as you have suggested?

Gabaldonlab / crossmapper

Input file error when running crossmapper tool on RNAseq reads #3