Generalise the process episodes

carpentries-incubator / workflows-nextflow

Workflow management with Nextflow and nf-core

https://carpentries-incubator.github.io/workflows-nextflow/

Other

18 stars 29 forks source link

Generalise the process episodes #71

Closed ggrimes closed 7 months ago

ggrimes commented 2 years ago

Change the process episodes to remove RNA-Seq specific examples and have more general ones. These new examples should only use basic UNIX commands, such as those mentioned in https://swcarpentry.github.io/shell-novice lesson.

Some examples for useful Bash commands to handle fasta files can be found here https://www.biostars.org/p/17680

ggrimes commented 2 years ago

As an example

Count the Number of sequences a fast

zgrep -c "^>" data/yeast/transcriptome/Saccharomyces_cerevisiae.R64-1-1.cdna.all.fa.gz

Then Count the number of T bases in fasta file

zgrep -v "^>" ./nextflow_rnaseq_training_dataset/data/yeast/transcriptome/Saccharomyces_cerevisiae.R64-1-1.cdna.all.fa.gz|grep -c T

Use a queue channel with A,T,G,C to count all bases

zgrep -v "^>" data/yeast/transcriptome/Saccharomyces_cerevisiae.R64-1-1.cdna.all.fa.gz|grep -c ${base}

For the combining channels this can be count the number of A,T,G,C within in each sequence within the fasta file
```
nextflow.enable.dsl=2
params.fasta=""
```

process COUNT {

input: each nt path sequence

script: """ grep -o -E '^>\w+' ${sequence}| tr -d '>'| tr '\n' '\t' printf $nt cat ${sequence}|grep -v '^>' |grep -c ${nt} """

}

ch_seq = Channel .fromPath(params.fasta) .splitFasta( by: 1 ,file:true) .take(10)

ch_base = Channel.of('A','T','G','C')

workflow { COUNT(ch_base,ch_seq) }

mahesh-panchal commented 2 years ago

Wouldn't this be considered complex for a novice?

  grep -o -E '^>\\w+' ${sequence}| tr -d '>'| tr  '\n' '\t'
  printf $nt
  cat ${sequence}|grep -v '^>' |grep -c ${nt}

ggrimes commented 2 years ago

Yes, it requires more than is described in the carpentries intro to Unix. Maybe there is an easier way to do this .

grep ">" ${sequence} |cut -f1 -d " "|tr -d ">"

mahesh-panchal commented 2 years ago

For sequence headers, I'll usually use

grep ">" ${sequence} | cut -c2-

but that still leaves everything after the space.

On topic though I still think we should minimize piping and have at most two pipes, with no regular expression stuff if possible.

ggrimes commented 2 years ago

https://www.nextflow.io/docs/latest/operator.html#splitfasta

Do you think using the splitfasta operator would be too much?

https://www.nextflow.io/docs/latest/operator.html#splitfasta

Channel
     .fromPath('data/yeast/reads/transcriptome/*')
     .splitFasta( record: [id: true, seqString: true ])

mahesh-panchal commented 2 years ago

Depends where one is in the episodes. Once you've covered operators, it should be fine.